Logo  

CS456 - Systems Programming

Compiling the Abstract Syntax Tree (AST)

main code:

The code in main just setups the prelude to the code generation, including the syscalls header, making sure the library functions printnumnl (print a number with a newline) and exit (call the exit system call) are declared global, then setup the data section with each variable declared as a quad- word with a default value of 0. Then setup the text section and set the _start label and make it global and start the compilation. Finish with a call to exit and that is the program.

  printf("%%include \"syscalls.h\"\n");
  printf("extern printnumnl, exit\n");

  printf("SECTION .data\n");
  for(var_t *v = vars; v; v=v->next) {
    printf("\t%s: dq 0\n", v->name);
  }

  printf("\nSECTION .text\n");

  printf("\nGLOBAL _start\n_start:\n");
  compile(hp, NULL, NULL);
  printf("\ncall\texit\n");

Compiling the statements

To compile the a statement we want to convert it to it's assembly template form. Lets use the while loop as the example. The template for a while loop would be:

.while:
        ; Evaluate the expression, result of which is at the top of the stack
        POP     rax             ; pop it into rax
        CMP     rax, 0          ; compare with 0 (i.e. false)
        JE      .wend           ; Exit loop when expression becomes false

        ; statement code here

        JMP .while          ; start the next loop iteration
.wend:
        ; rest of program


All we then need to do is emit the right assembly code at the right places. One issue is to create unique labels for each statement, this can be done be appending an increasing number to each label. The following code takes a label prefix and appends to it just such a number to generate a unique label:

int labelnum = 0;

char *newlabel(char *prefix)
{
  char buf[K];
  sprintf(buf, "%s%d", prefix, labelnum++);
  return strdup(buf);
}

Now we further consider how the while statement is compiled:

      case ST_WHILE:
        l1 = newlabel(".while");
        l2 = newlabel(".wend");
        printf("%s:\n", l1);
        compile_expr(n->w->expr);
        printf("\tPOP   rax\n");
        printf("\tCMP   rax, 0\n");
        printf("\tJE    %s\n\n", l2);
        compile(n->w->body, l1, l2);
        printf("\tJMP   %s\n", l1);
        printf("%s:\n", l2);
        free(l1); free(l2);
        break;


Here we allocate the two labels needed for our template, stored in the string pointers l1 and l2 and begin emitting the code. Where the expression and statement body go are just calls to the functions to emit the code for those things. In the recursive call to compile the l1 label becomes the contlbl (continue label) for any continue statements (basically a jump back up to the top of the loop.) In a for-loop the continue statement would jump to where the increment expression is located. l2 becomes the brklbl or the label defining the exit point for our statement template. Note in the full listing that the if statement does not use it's own labels to replace contlbl or brklbl as break and continue only apply to the most recent loop parent statement.

void compile(stnode_t *st, char *contlbl, char *brklbl)
{
  char *l1, *l2;
  stnode_t *n;

  for(n = st; n; n=n->next) {
    switch(n->type) {
      case ST_EXPRESSION:
        compile_expr(n->expr);
        // Result must be removed from the stack and discarded:
        printf("\tPOP   rax\n");
        break;
      case ST_PRINT:
        compile_expr(n->expr);
        printf("\tPOP   rax\n");
        printf("\tCALL  printnumnl\n\n");
        break;
      case ST_CONTINUE:
        if (contlbl)
          printf("\tJMP %s\n", contlbl);
        else die("No loop to continue from.\n");
        break;
      case ST_BREAK:
        if (brklbl)
          printf("\tJMP %s\n", brklbl);
        else die("No loop to break from.\n");
        break;
      case ST_WHILE:
        l1 = newlabel(".while");
        l2 = newlabel(".wend");
        printf("%s:\n", l1);
        compile_expr(n->w->expr);
        printf("\tPOP   rax\n");
        printf("\tCMP   rax, 0\n");
        printf("\tJE    %s\n\n", l2);
        compile(n->w->body, l1, l2);
        printf("\tJMP   %s\n", l1);
        printf("%s:\n", l2);
        free(l1); free(l2);
        break;
      case ST_IF:
        l1 = newlabel(".ifend");
        if (n->i->elsebody) l2 = newlabel(".else");

        compile_expr(n->i->expr);
        printf("\tPOP   rax\n");
        printf("\tCMP   rax, 0\n");
        printf("\tJE    %s\n", n->i->elsebody ? l2 : l1);
        compile(n->i->body,contlbl,brklbl);
        if (n->i->elsebody) {
          printf("%s:\n", l2);
          compile(n->i->elsebody,contlbl,brklbl);
        }
        printf("%s:\n", l1);
        free(l1);
        if (n->i->elsebody) free(l2);
        break;
    }
  }
}

Compiling an expression:

When compiling an expression there is some difference to just walking the tree and emitting the code, particularly when handling leaf nodes. Leaf nodes have no left or right to walk and so are "push" only, i.e. they only push values to the stack and perform no pop operations prior to pushing their values. I modify the expression handling code slightly to move the push only operations to the top and perform them separately. Note how the leaf node case statements return from the function rather than continuing.

If the op is not a number or variable then recursively compile the left side of the expression (except in the case of an assignment) then the right side. After the recursive calls we issue two pop instructions, first for the left side of the operation (placed into rbx), then the right side (placed into rax). For an assignment only one pop is required, as only the right side of the operator pushed anything to the stack. Note that I pop the "right" side into rax (normally the "left" side register) for assignments because we also push the result of the assignment onto the stack as well, and I always assume that the result is in rax (thus the final PUSH rax at the bottom.) This negates having to move rbx into rax for the assignment or emitting a special PUSH rbx just for the assignment operation.

Then the operation is performed with the rax and rbx registers representing the left and right side of the expressions respectively. The result of the expression, assumed to be in rax, is then finally pushed onto the stack.

Boolean operators (including the comparison operations) require jumps and so labels are also allocated for these operations and code is emitted for those operations in "template" form. Note how for the Boolean OR or AND operations we compare the left and right side against 0 (false), as false has only one numeric value (i.e. 0), whereas true is anything else. The result of a Boolean (Logical) AND or OR should always just be a 0 or 1, so it is not sufficient to using the Binary AND or OR operations to perform these operations.

void compile_expr(enode_t *e)
{
  char *l1, *l2;

  // Leaf nodes are "push" only:
  switch(e->op) {
    case T_NUMBER:
      printf("\tPUSH    %ld\n", e->value);
      return;
    case T_IDENTIFIER:
      printf("\tMOV rax, [%s]\n", e->var->name);
      printf("\tPUSH    rax\n\n");
      return;
    default:
      break;
  }

  // Don't "evaluate" the lval of an assignment.
  if (e->left && e->op != T_ASSIGN) compile_expr(e->left);
  if (e->right) compile_expr(e->right);

  if (e->op != T_ASSIGN) printf("\tPOP  rbx\n");
  printf("\tPOP rax\n");
  switch(e->op) {
    case T_PLUS:
      printf("\tADD rax, rbx\n");
      break;
    case T_MINUS:
      printf("\tSUB rax, rbx\n");
      break;
    case T_MULT:
      printf("\tIMUL    rbx\n");
      break;
    case T_DIV:
      printf("\tCQO\n");
      printf("\tIDIV    rbx\n");
      break;
    case T_MOD:
      printf("\tCQO\n");
      printf("\tIDIV    rbx\n");
      printf("\tMOV rax, rdx\n");
      break;
    case T_OR:
      l1 = newlabel(".nz");
      l2 = newlabel(".f");
      printf("\tCMP rax, 0\n");
      printf("\tJNE %s\n",l1);
      printf("\tCMP rbx, 0\n");
      printf("\tJNE %s\n",l1);
      printf("\tMOV rax, 0\n");
      printf("\tJMP %s\n",l2);
      printf("%s:\n",l1);
      printf("\tMOV rax, 1\n");
      printf("%s:\n",l2);
      free(l1);
      free(l2);
      break;
    case T_AND:
      l1 = newlabel(".z");
      l2 = newlabel(".f");
      printf("\tCMP rax, 0\n");
      printf("\tJE  %s\n",l1);
      printf("\tCMP rbx, 0\n");
      printf("\tJE  %s\n",l1);
      printf("\tMOV rax, 1\n");
      printf("\tJMP %s\n",l2);
      printf("%s:\n",l1);
      printf("\tMOV rax, 0\n");
      printf("%s:\n",l2);
      free(l1);
      free(l2);
      break;
    case T_LT:
      l1 = newlabel(".lt");
      l2 = newlabel(".lt");
      printf("\tCMP rax, rbx\n");
      printf("\tJL  %s\n", l1);
      printf("\tMOV rax, 0\n");
      printf("\tJMP %s\n", l2);
      printf("%s:\n", l1);
      printf("\tMOV rax, 1\n");
      printf("%s:\n", l2);
      free(l1); free(l2);
      break;
    case T_GT:
      l1 = newlabel(".gt");
      l2 = newlabel(".gt");
      printf("\tCMP rax, rbx\n");
      printf("\tJG  %s\n", l1);
      printf("\tMOV rax, 0\n");
      printf("\tJMP %s\n", l2);
      printf("%s:\n", l1);
      printf("\tMOV rax, 1\n");
      printf("%s:\n", l2);
      free(l1); free(l2);
      break;
    case T_EQUAL:
      l1 = newlabel(".eq");
      l2 = newlabel(".eq");
      printf("\tCMP rax, rbx\n");
      printf("\tJE  %s\n", l1);
      printf("\tMOV rax, 0\n");
      printf("\tJMP %s\n", l2);
      printf("%s:\n", l1);
      printf("\tMOV rax, 1\n");
      printf("%s:\n", l2);
      free(l1); free(l2);
      break;
    case T_ASSIGN:
      printf("\tMOV [%s], rax\n", e->left->var->name);
      break;
    default:
      break;
  }
  printf("\tPUSH    rax\n\n");
}