CS456 - Systems Programming

Tokenization (lexical analysis)

Tokenization is the process of converting strings into "tokens", which are often numbers. A token usually then refers to a specific word or symbol in a language.

You can use #defines or better yet, enumerated types for the token values:

typedef enum { T_WORD1, T_WORD2, ... } token_t;

typedef defines an alias for a type or structure.

typedef real-type alias

enum's are symbolic names for integer values, starting at 0. T_WORD1 would be 0, T_WORD2 would be 1, etc. The names can be given specific values, i.e.:

enum {T_WORD1, T_WORD2=5, T_WORD3 }

T_WORD1 == 0, T_WORD2 == 5, T_WORD3 == 6

token_t can now be used as the type for a variable:

token_t token;

typedef enum { TRUE=1, FALSE=0 } bool;
bool flag = TRUE;

Why not strtok()?

strtok() is not good as a tokenizer, because it:

modifies the source string,
has severe limitations on the kinds of "tokens" it can handle
does not return an integer token value. Integers representing keywords and tokens are more easily handled in if statements and switches.

Tokenizing an expression:

Take an arithmetic expression composed of: +, -, *, **, /, ()'s and numbers. We wish to make tokens for each "word" and ignore all white-space:

  typedef enum {
    T_NUMBER, T_PLUS, T_MINUS, T_MULT, T_DIV, T_EXP, T_OPAREN, TCPAREN, T_EOE,
    T_UNKNOWN
  } token_t;

T_EOE will represent the end of the expression (or end of input.)
T_UNKNOWN for any unknown symbol.

When making our own tokenizer. It should:

not modify the input,
ignore all white-space and
return token values representing the token type.

Input is to be taken from a string:

char *input;
int pos = 0;

// Gets the next character, return -1 on end of input
int get(void) {
  if (input[pos] == '\0') return -1;
  return input[pos++];
}

// Unget a character:
void unget(int c)
{
  if (pos && c != -1) pos--;
}

// If the next character is c, then move past it and return TRUE, otherwise
// do nothing and return FALSE
bool next(int c) {
  if (input[pos] == c) {
    pos++;
    return TRUE;
  }
  return FALSE;
}

/**
 * Returns the next token. Fills value in with the value of a number for
 * T_NUMBER tokens.
 */
token_t lex(int *value) {
  int c, n;

  // Skip over any white-space:
  do {
    c = get();
  } while (isspace(c));

  switch(c) {
    case -1: return T_EOE;
    case '(': return T_OPAREN;
    case ')': return T_CPAREN;
    case '+': return T_PLUS;
    case '-': return T_MINUS;
    case '*':
      if (next('*')) return T_EXP;
      return T_MULT;
    case '/': return T_DIV;
    default:
      if (isdigit(c)) {
        n=0;
        do {
          n *= 10;
          n += (c-'0');
        } while (isdigit(c = get()));
        *value = n;
        unget(c);
        return T_NUMBER;
      }
      // optionally return the back character back to the input.
      unget(c);
      return T_UNKNOWN;
  }
  // NOT REACHED
  return T_UNKNOWN;
}

More advanced example handling: `<`, `<=`, `<<`, `<<=`:

    case '<':
      if (next('=')) return T_LE;       // <=
      if (next('<')) {                  // << or <<=
        if (next('=')) return T_SHL_ASSIGN;
        return T_SHL;
      }
      return T_LT;                      // Just <

Exercise:

Create a tokenizer for <, <<, <<-, <<<, >, >>, &>, &>>, |, ||, |&, &, &&, quoted strings and unquoted words. (These are bash shell symbols)

Making a simple calculator:

There are three kinds of arithmetic notation, concerning the location of the operators with respect to their operands.

infix: The operator appears in between the operands, i.e: 2 + 2
prefix: (polish): The operator appears before the operands, i.e.: + 2 2
- This is identical to typical function call prefix, i.e.: add(2,2)
postfix: (reverse-polish): The operator appears after the operands, i.e.: 2 2 +

A postfix or reverse polish notation (RPN) calculator is relatively easy to create. It requires only a stack for the numbers. When a number is encountered it is pushed on the stack. When an operator is encountered, it pops two values off the stack, performs the operation and pushes the result back on the stack. After all input is processed, the top value (which should be the only value) is pop'ed off the stack and printed as the result.

Examples
infix	`2 + (3 * 4) + 5`
postfix	`2 3 4 * + 5 +`
with ()'s:	`((2 (3 4 *) +) 5 +)`

Using the tokenizer created above to create a RPN calculator.

rpn.c

int numstack[K];
int sp = 0;

void push(int n) {
  numstack[sp++] = n;
}

int pop(void) {
  if (sp) return numstack[--sp];
  // should throw an error here (stack underflow).
  return 0;
}

int main() {
  ...
  while ((tok = lex(&val) != T_EOE) {
    switch(tok) {
      case T_NUMBER: push(val); break;
      case T_PLUS: push(pop() + pop()); break;
      case T_MULT: push(pop() * pop()); break;
      // Order of the operands matters for - and /.
      case T_MINUS:
        // Right side of any RPN expression is the top value on the stack.
        r = pop();
        l = pop();
        push(l - r);
        break;
      case T_DIV:
        r = pop();
        l = pop();
        push(l / r);
        break;
    }
  }
  printf("%d\n", pop());
}