Now that we should have some idea how to form an expression tree, we want to add control structures, such as conditionals and loops to form the beginnings of a language. Most computer languages are composed of a sequence of statements, of which the most common would be:
To define these control structures we should first define their grammar, or syntax, that is the sequence of keywords, tokens, expressions and statements that comprise and then convert that grammar into data structures that represent the control structure.
Most computer grammars are context-free in that the meaning of words in the language do not vary depending on their location with respect to the other words in the language as opposed to context sensitive grammars who's words do change meaning depending on the surrounding grammar.
C is mostly context free (typedefs throw a wrench into the works,) but we'll try to use it's grammar for our tiny language. To define our grammar we'll use a notation called Extended Backus-Naur form (or EBNF) which was born of Backus-Naur form for defining context-free formal grammars, then improved by incorporating concepts from other notation languages such as Wirth notation.
EBNF consists of terminal symbols (letters, numbers, tokens and keywords,) and non-terminal production rules which define the rules which terminal and other non-terminal symbols are combined to form the grammar. In EBNF terminals are quoted strings and non-terminals are unquoted identifiers.
A production rule is a non-terminal on the left side of an equal sign followed by one or more vertical bar (|) separated rules on the right side that define the expansion of the non-terminal with the following notations:
Usage Notation definition = concatenation , termination ; alternation | optional [ ... ] repetition { ... } grouping ( ... ) terminal string " ... " terminal string ' ... ' comment (* ... *)
=
,
;
|
[
]
{
}
(
)
"
'
(*
*)
Example:
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; integer = [ "-" ], { digit } ;
Our languages grammar:
expression = ... statement = expression ";" | "if" "(" expression ")" statement [ "else" statement ] | "while" "(" expression ")" statement | "print" "(" expression ")" ";" | "break" ";" | "continue ";" | "{" { statement } "}" ;
We know how to parse expressions and create an expression tree that can be walked during run-time to produce a result, all that we need now is to create a tree of statements, known as an Abstract Syntax Tree (AST) that would be walked in a similar manner. First we define the structure for a statement:
// Our statement types: typedef enum { ST_EXPRESSION, ST_WHILE, ST_IF, ST_PRINT, ST_BREAK, ST_CONTINUE } stmt_t; struct statement { stmt_t type; union { enode_t *expr; // Used for expression and "print" statements struct while_st *w; // Structure for a while-statement struct if_st *i; // Structure for a if/if-else statement }; struct statement *next; };
A block statement is just a sequence of statements (i.e. follow the next link until it is NULL). Here we also use an anonymous union to preserve space since only one of the pointers inside the union will be used depending on the statement type. Here are the structures for while and if statements:
struct while_st { enode_t *expr; stnode_t *body; }; struct if_st { enode_t *expr; // If elsebody == NULL then there is no else clause stnode_t *body, *elsebody; };
To properly parse a program we need to be able to preserve the state of the current token between several different functions and only advance to the next token when we are at a place where we can handle that token. We also will need to add our keywords (words in the language that are reserved for the language alone, so as to eliminate ambiguity with identifiers.) We'll also modify the lexer to take input from a FILE stream.
First to add our keywords, we need a token value for each:
typedef enum { T_PLUS, T_MINUS, T_MULT, T_DIV, T_MOD, T_OR, T_AND, T_LT, T_GT, T_EQUAL, T_ASSIGN, T_OPAREN, T_CPAREN, T_NUMBER, T_IDENTIFIER, T_IF, T_ELSE, T_WHILE, T_PRINT, T_BREAK, T_CONTINUE, T_OCBRACE, T_CCBRACE, T_EOS, T_EOI, T_UNKNOWN } token_t;
The above adds support for each keyword and {}'s and ; (End of Statement (T_EOS)) symbols. Next we define what our current token should be:
{}
// Current token with buffer space or value: struct cur_tok { token_t tok; char buf[K]; int64_t value; };
Next we modify the startlex, get, unget and next functions:
startlex
get
unget
next
FILE *input; struct cur_tok _cur; int _cstack[100]; int _csp = 0; void startlex(FILE *fp) { input = fp; // Immediately get the first token: _cur.tok = lex(&_cur.value, _cur.buf); } // Gets the next character, returns EOF (-1) on end of input, pops from the // unget character stack first if it's not empty. int get(void) { if (_csp) return _cstack[--_csp]; return fgetc(input); } // Unget a character, pushes it to the unget character stack: void unget(int c) { if (c != EOF) _cstack[_csp++] = c; } // If the next character matches c, then move past it and return TRUE, // otherwise put it back and return FALSE bool next(int c) { int ch = get(); if (ch == c) return TRUE; unget(ch); return FALSE; }
Adding support for {, } and ; is straightforward, but adding keyword support is done in the identifier routine:
identifier
token_t identifier(int c, char *buf) { int p = 0; buf[p++] = c; for(c=get(); isalpha(c) || isdigit(c) || c == '_'; c = get()) buf[p++] = c; buf[p] = '\0'; unget(c); if (strcmp(buf, "if") == 0) return T_IF; if (strcmp(buf, "else") == 0) return T_ELSE; if (strcmp(buf, "while") == 0) return T_WHILE; if (strcmp(buf, "print") == 0) return T_PRINT; if (strcmp(buf, "break") == 0) return T_BREAK; if (strcmp(buf, "continue") == 0) return T_CONTINUE; return T_IDENTIFIER; }
If more keywords were to be added it would be better to use a binary search on a sorted list of keywords to check if the given identifier in buf is a keyword or not.