Scanning

Last updated Apr 24, 2023 Edit Source

# Scanning

The first job of an interpreter/compiler is to scan the raw source code as characters and group them into something meaningful.

# Lexemes

A lexeme is smallest sequence of characters which represents something. var language = "lox"; The lexemes here are

var
language
=
“lox”
; In this grouping process, we can gather other useful information.

We can categorize tokens from a raw lexeme by comparing strings, but that is slow. Looking through individual characters should be delegated to the Scanner. The parser on the other hand, just needs to know which kind of lexeme it represents. E.g.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
enum TokenType {
  // Single-character tokens.
  LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
  COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,

  // One or two character tokens.
  BANG, BANG_EQUAL,
  EQUAL, EQUAL_EQUAL,
  GREATER, GREATER_EQUAL,
  LESS, LESS_EQUAL,

  // Literals.
  IDENTIFIER, STRING, NUMBER,

  // Keywords.
  AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
  PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,

  EOF
}

# Regex as an alternative

Lexical grammar: the rules for how a programming language groups characters into lexemes.

Regular Language: if the lexical grammar can be defined by regular expressions.

# Scanner algorithm

Use 2 offset variables start and current to index into the string. Recognising lexemes can be done with simple match statements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
private void scanToken() {
    char c = advance();
    switch (c) {
      case '(': addToken(LEFT_PAREN); break;
      case ')': addToken(RIGHT_PAREN); break;
      case '{': addToken(LEFT_BRACE); break;
      case '}': addToken(RIGHT_BRACE); break;
      case ',': addToken(COMMA); break;
      case '.': addToken(DOT); break;
      case '-': addToken(MINUS); break;
      case '+': addToken(PLUS); break;
      case ';': addToken(SEMICOLON); break;
      case '*': addToken(STAR); break; 
    }
  }

advance() consumes the next character in the source file

1
2
3
  private char advance() {
	return source.charAt(current++);
  }

addToken() grabs the text representing the current lexeme and creates a new token corresponding to a specific token type.

1
2
3
4
private void addToken(TokenType type, Object literal) {
    String text = source.substring(start, current);
    tokens.add(new Token(type, text, literal, line));
  }

Brendan Ang

Scanning

# Scanning

# Lexemes

# Token type

# Regex as an alternative

# Scanner algorithm

# Longer Lexemes

# Literals

# Reserved words and Identifiers

Backlinks

Interactive Graph