Brendan Ang

Search

Search IconIcon to open search

Scanning

Last updated Apr 24, 2023 Edit Source

# Scanning

The first job of an interpreter/compiler is to scan the raw source code as characters and group them into something meaningful.

# Lexemes

A lexeme is smallest sequence of characters which represents something. var language = "lox"; The lexemes here are

# Token type

We can categorize tokens from a raw lexeme by comparing strings, but that is slow. Looking through individual characters should be delegated to the Scanner. The parser on the other hand, just needs to know which kind of lexeme it represents. E.g.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
enum TokenType {
  // Single-character tokens.
  LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
  COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,

  // One or two character tokens.
  BANG, BANG_EQUAL,
  EQUAL, EQUAL_EQUAL,
  GREATER, GREATER_EQUAL,
  LESS, LESS_EQUAL,

  // Literals.
  IDENTIFIER, STRING, NUMBER,

  // Keywords.
  AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
  PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,

  EOF
}

# Regex as an alternative

Lexical grammar: the rules for how a programming language groups characters into lexemes.

Regular Language: if the lexical grammar can be defined by regular expressions.

# Scanner algorithm

Use 2 offset variables start and current to index into the string. Recognising lexemes can be done with simple match statements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
private void scanToken() {
    char c = advance();
    switch (c) {
      case '(': addToken(LEFT_PAREN); break;
      case ')': addToken(RIGHT_PAREN); break;
      case '{': addToken(LEFT_BRACE); break;
      case '}': addToken(RIGHT_BRACE); break;
      case ',': addToken(COMMA); break;
      case '.': addToken(DOT); break;
      case '-': addToken(MINUS); break;
      case '+': addToken(PLUS); break;
      case ';': addToken(SEMICOLON); break;
      case '*': addToken(STAR); break; 
    }
  }
1
2
3
  private char advance() {
	return source.charAt(current++);
  }
1
2
3
4
private void addToken(TokenType type, Object literal) {
    String text = source.substring(start, current);
    tokens.add(new Token(type, text, literal, line));
  }

# Longer Lexemes

To handle scanning longer lexemes, we use a lookahead. After detecting the start of a lexeme, we pass control over to some lexeme specific code that consumes characters until it reaches the end.

# Literals

Strategy is similar to longer lexemes. For strings, we start consuming when we see a “, for numbers, we start when we see a digit.

# Reserved words and Identifiers

Maximal munch principle: when two lexical grammar rules can both match a chunk of code that the scanner is looking at, whichever one matches the most characters wins.

nil_identifier is not matched as nil <= is matched as <= and not < followed by =