Logo  

CS456 - Systems Programming

Adding variables to Shunting Yard

Note: in this part we start with the code from the original shunting code, not the code from lesson 8 which demonstrates how the action() function can be used to generate assembly rather than simply interpreting the actions in C.

In this part we want to add variables to our expressions. Our variables will just quad-word sized integers to keep things simple. To add support for variables we first need to update the lexer to add support for identifiers (i.e. a name that identifies a value.) In many languages the lexical rule for identifiers are words that start with a letter or underscore (_) and are then followed by zero or more letters, numbers or underscores.

Changes to the lexer

In lex.c we add support for identifiers with:

  1. Add a T_IDENTIFIER token (in lex.h):
typedef enum {
  T_PLUS, T_MINUS, T_MULT, T_DIV, T_EXP, T_ASSIGN, T_OPAREN, T_CPAREN,
  T_NUMBER, T_IDENTIFIER, T_EOE, T_UNKNOWN
} token_t;

Note we also add an T_ASSIGN token type, for the assignment operator. It is also added to our tables. FYI: the numbers for the precedence of these operators is (mostly) taken from the C operator precedence table where the lowest precedence operator (commas) start at 5 and go up in steps of 5.

//               +,  -,  *,  /, **,  =,  (,  )
int  prec[] = { 60, 60, 65, 65, 70, 10,  0,  0};
int uprec[] = { 75, 75, -1, -1, -1, -1, -1, -1};
int   dir[] = {  0,  0,  0,  0,  0,  1,  0,  0};
int  udir[] = {  1,  1,  0,  0,  0,  0,  0,  0};

  1. Add char *buf parameter to the lex() function: token_t lex(int *value, char *buf)
    • buf will contain the identifier string when the token type is identifier. This could also be used for string and numeric tokens (eliminating the need for the value parameter.)
  2. in the default case of the switch(c) { of the lex() function, add:
    default:
      if (isalpha(c) || c == '_') return identifier(c,buf);
      if (isdigit(c)) {
        ...

Thus words that start with a letter or _ will be considered identifiers.

  1. Add the following function to lex identifiers:
token_t identifier(char c, char *buf) {
  int i = 0;
  buf[i++] = c;
  for(c = get(); isalpha(c) || isdigit(c) || c == '_'; c = get())
    buf[i++] = c;
  unget();
  buf[i] = '\0';
  return T_IDENTIFIER;
}


Changes to the Shunting Yard program

That should complete the changes needed to the lexer. The changes needed to the shunting yard are as follows:

  1. Add a symbol table for variables, we'll just use a linked list and add lookup function to find a symbol or add it if it isn't found:
/**
 * "symbol table" for variables.
 */
struct var {
  char *name;       // The identifiers name
  int64_t val;      // Its value
  struct var *next;
} *vars = NULL;     // The head of the "symbol table"

/**
 * Finds a "symbol", if it doesn't exist, allocate a new one. We return the
 * address of it's value which is pushed onto the number stack in lieu of a
 * value.
 */
int64_t *lookup(char *name) {
  struct var *v;
  for(v=vars; v != NULL; v=v->next)
    if (strcmp(v->name, name) == 0) return &(v->val);
  v = malloc(sizeof(struct var));
  v->name = strdup(name);
  v->val = 0;
  v->next = vars;
  vars = v;

  return &(v->val);
}

A symbol table is mapping of symbolic names to (often unique) values, usually integer values. In our case we'll map our identifier symbols to the address of their values. A proper implementation of a basic symbol table would take some care to implement the search functionality as optimally as possible, so either a balanced binary search tree (BST) or ideally a hash table, but for simplicity (and since we'll have few symbols anyway,) we'll just use a linked list which is easy to implement.

A real implementation would have a nested, multi-level symbol table capable of keeping track of block-local, local and global symbols. In such a implementation a new table would be created when entering a "block" with a pointer to it's "parent" table, the newly created table would then be that blocks symbol table. A variable definition is normally the only way to add a symbol to the symbol table. Some language may by default add an unknown symbol to either the nearest symbol table or the global one. When searching for a symbol in the table, you begin with the table that is associated with the block you are in, and if the symbol is not found, try the parent table, until you reach the top-level global table, and if not found, either add it or signal an error (such as "undefined variable".)

  1. Modify the number stack to be a "value" stack, instead of just a sequence of numbers, it will be a sequence of value structures so we can keep track of the type of value being pushed onto the stack (numeric constant vs. address.)
// Value stack:
struct val {
  int64_t num;  // Num may be a unsigned 64 bit address if type == ADDRESS
  value_t type; // If so, it will need to be cast back into an address.
} vstack[K];
int vsp = 0;

void push_val(int64_t n, value_t type) {
  vstack[vsp++] = (struct val){.num = n, .type = type};
}

struct val pop_val(void) {
  if (vsp <= 0) die("Stack underflow");
  return vstack[--vsp];
}

/**
 * This is a helper function to give us a real value from the val struct. If
 * the type is VALUE, just return v.num as is, otherwise type-cast it back
 * to an address and de-ereference it to get the value at the address.
 */
int64_t gv(struct val v) {
  return (v.type == VALUE)? v.num : *(int64_t *)v.num;
}

  1. Update the main loop to use the new value stack and handle identifiers:
    case T_NUMBER:
      if (flag != NUMBER) die("Syntax error");
      push_val(num, VALUE);
      flag = OPERATOR;
      break;
    case T_IDENTIFIER:
      addr = lookup(buf);
      push_val((int64_t)addr, ADDRESS);
      flag = OPERATOR;
      break;

  1. Finally update action():
void action(void)
{
  struct op op;
  struct val l, r;
  int64_t v;

  // Get the operator:
  op = pop_op();

  // Right side is at the top of the stack, left would be underneath it,
  // unless it's a unary op
  r = pop_val();
  if (op.unary == FALSE) l = pop_val();
  else l = (struct val){.num = 0, .type = VALUE}; // 0 as a "value"

  if (uprec[op.op] == -1 && op.unary) die("Malformed unary expression in action.");
  // Perform the operation:
  switch(op.op) {
    case T_PLUS:  v = gv(l) + gv(r); break;
    case T_MINUS: v = gv(l) - gv(r); break;
    case T_MULT:  v = gv(l) * gv(r); break;
    case T_DIV:   v = gv(l) / gv(r); break;
    case T_EXP:   v = (int64_t)pow((double)gv(l),(double)gv(r)); break;
    case T_ASSIGN:
      if (l.type != ADDRESS) die("left side of assignment is not an lval.\n");
      *(int64_t *)l.num = (v = gv(r));
      break;
    default:
      die("Illegal operation.");
  }
  // Push the result onto the stack:
  push_val(v, VALUE);
}

Note the use of gv() to get the real value of a "value" structure, except in the assignment case, where we want the address of the left value (lval) as we need to store the result of the right side at the address (by casting l.num to an address and then de-reference it to store the value.) We still push the value of the right side of the assignment to the stack so that it can be used by other operations, for example:

a = b = 5

The b = 5 part will be evaluated first (= is right-to-left associative), the result will be 5, which will then be assigned to a.

Other changes

One additional change to the main function is getting input from either stdin or a file and reading lines of input rather than using strings from the command line. Note that we re-start the lexer for each input line. Ideally we would just hand the lexer a file descriptor and it can be read input directly from the input stream.

The die() function also now prints (approximately) where in the input the parsing failed. Ideally it would record the position in the input prior to each token fetch as some tokens can take many characters (identifiers, strings, etc.)