Note: in this part we start with the code from the original shunting code, not the code from lesson 8 which demonstrates how the action() function can be used to generate assembly rather than simply interpreting the actions in C.
action()
In this part we want to add variables to our expressions. Our variables will just quad-word sized integers to keep things simple. To add support for variables we first need to update the lexer to add support for identifiers (i.e. a name that identifies a value.) In many languages the lexical rule for identifiers are words that start with a letter or underscore (_) and are then followed by zero or more letters, numbers or underscores.
In lex.c we add support for identifiers with:
T_IDENTIFIER
lex.h
typedef enum { T_PLUS, T_MINUS, T_MULT, T_DIV, T_EXP, T_ASSIGN, T_OPAREN, T_CPAREN, T_NUMBER, T_IDENTIFIER, T_EOE, T_UNKNOWN } token_t;
Note we also add an T_ASSIGN token type, for the assignment operator. It is also added to our tables. FYI: the numbers for the precedence of these operators is (mostly) taken from the C operator precedence table where the lowest precedence operator (commas) start at 5 and go up in steps of 5.
T_ASSIGN
// +, -, *, /, **, =, (, ) int prec[] = { 60, 60, 65, 65, 70, 10, 0, 0}; int uprec[] = { 75, 75, -1, -1, -1, -1, -1, -1}; int dir[] = { 0, 0, 0, 0, 0, 1, 0, 0}; int udir[] = { 1, 1, 0, 0, 0, 0, 0, 0};
char *buf
lex()
token_t lex(int *value, char *buf)
buf
switch(c) {
default: if (isalpha(c) || c == '_') return identifier(c,buf); if (isdigit(c)) { ...
Thus words that start with a letter or _ will be considered identifiers.
token_t identifier(char c, char *buf) { int i = 0; buf[i++] = c; for(c = get(); isalpha(c) || isdigit(c) || c == '_'; c = get()) buf[i++] = c; unget(); buf[i] = '\0'; return T_IDENTIFIER; }
That should complete the changes needed to the lexer. The changes needed to the shunting yard are as follows:
/** * "symbol table" for variables. */ struct var { char *name; // The identifiers name int64_t val; // Its value struct var *next; } *vars = NULL; // The head of the "symbol table" /** * Finds a "symbol", if it doesn't exist, allocate a new one. We return the * address of it's value which is pushed onto the number stack in lieu of a * value. */ int64_t *lookup(char *name) { struct var *v; for(v=vars; v != NULL; v=v->next) if (strcmp(v->name, name) == 0) return &(v->val); v = malloc(sizeof(struct var)); v->name = strdup(name); v->val = 0; v->next = vars; vars = v; return &(v->val); }
A symbol table is mapping of symbolic names to (often unique) values, usually integer values. In our case we'll map our identifier symbols to the address of their values. A proper implementation of a basic symbol table would take some care to implement the search functionality as optimally as possible, so either a balanced binary search tree (BST) or ideally a hash table, but for simplicity (and since we'll have few symbols anyway,) we'll just use a linked list which is easy to implement.
A real implementation would have a nested, multi-level symbol table capable of keeping track of block-local, local and global symbols. In such a implementation a new table would be created when entering a "block" with a pointer to it's "parent" table, the newly created table would then be that blocks symbol table. A variable definition is normally the only way to add a symbol to the symbol table. Some language may by default add an unknown symbol to either the nearest symbol table or the global one. When searching for a symbol in the table, you begin with the table that is associated with the block you are in, and if the symbol is not found, try the parent table, until you reach the top-level global table, and if not found, either add it or signal an error (such as "undefined variable".)
// Value stack: struct val { int64_t num; // Num may be a unsigned 64 bit address if type == ADDRESS value_t type; // If so, it will need to be cast back into an address. } vstack[K]; int vsp = 0; void push_val(int64_t n, value_t type) { vstack[vsp++] = (struct val){.num = n, .type = type}; } struct val pop_val(void) { if (vsp <= 0) die("Stack underflow"); return vstack[--vsp]; } /** * This is a helper function to give us a real value from the val struct. If * the type is VALUE, just return v.num as is, otherwise type-cast it back * to an address and de-ereference it to get the value at the address. */ int64_t gv(struct val v) { return (v.type == VALUE)? v.num : *(int64_t *)v.num; }
case T_NUMBER: if (flag != NUMBER) die("Syntax error"); push_val(num, VALUE); flag = OPERATOR; break; case T_IDENTIFIER: addr = lookup(buf); push_val((int64_t)addr, ADDRESS); flag = OPERATOR; break;
void action(void) { struct op op; struct val l, r; int64_t v; // Get the operator: op = pop_op(); // Right side is at the top of the stack, left would be underneath it, // unless it's a unary op r = pop_val(); if (op.unary == FALSE) l = pop_val(); else l = (struct val){.num = 0, .type = VALUE}; // 0 as a "value" if (uprec[op.op] == -1 && op.unary) die("Malformed unary expression in action."); // Perform the operation: switch(op.op) { case T_PLUS: v = gv(l) + gv(r); break; case T_MINUS: v = gv(l) - gv(r); break; case T_MULT: v = gv(l) * gv(r); break; case T_DIV: v = gv(l) / gv(r); break; case T_EXP: v = (int64_t)pow((double)gv(l),(double)gv(r)); break; case T_ASSIGN: if (l.type != ADDRESS) die("left side of assignment is not an lval.\n"); *(int64_t *)l.num = (v = gv(r)); break; default: die("Illegal operation."); } // Push the result onto the stack: push_val(v, VALUE); }
Note the use of gv() to get the real value of a "value" structure, except in the assignment case, where we want the address of the left value (lval) as we need to store the result of the right side at the address (by casting l.num to an address and then de-reference it to store the value.) We still push the value of the right side of the assignment to the stack so that it can be used by other operations, for example:
gv()
l.num
a = b = 5
The b = 5 part will be evaluated first (= is right-to-left associative), the result will be 5, which will then be assigned to a.
b = 5
=
5
a
One additional change to the main function is getting input from either stdin or a file and reading lines of input rather than using strings from the command line. Note that we re-start the lexer for each input line. Ideally we would just hand the lexer a file descriptor and it can be read input directly from the input stream.
The die() function also now prints (approximately) where in the input the parsing failed. Ideally it would record the position in the input prior to each token fetch as some tokens can take many characters (identifiers, strings, etc.)
die()