Scanning
# Scanning
The first job of an interpreter/compiler is to scan the raw source code as characters and group them into something meaningful.
# Lexemes
A lexeme is smallest sequence of characters which represents something.
var language = "lox";
The lexemes here are
- var
- language
- =
- “lox”
- ; In this grouping process, we can gather other useful information.
# Token type
We can categorize tokens from a raw lexeme by comparing strings, but that is slow. Looking through individual characters should be delegated to the Scanner. The parser on the other hand, just needs to know which kind of lexeme it represents. E.g.
|
|
# Regex as an alternative
Lexical grammar: the rules for how a programming language groups characters into lexemes.
Regular Language: if the lexical grammar can be defined by regular expressions.
# Scanner algorithm
Use 2 offset variables start
and current
to index into the string.
Recognising lexemes can be done with simple match statements.
|
|
advance()
consumes the next character in the source file
|
|
addToken()
grabs the text representing the current lexeme and creates a new token corresponding to a specific token type.
|
|
# Longer Lexemes
To handle scanning longer lexemes, we use a lookahead. After detecting the start of a lexeme, we pass control over to some lexeme specific code that consumes characters until it reaches the end.
# Literals
Strategy is similar to longer lexemes. For strings, we start consuming when we see a “, for numbers, we start when we see a digit.
# Reserved words and Identifiers
Maximal munch principle: when two lexical grammar rules can both match a chunk of code that the scanner is looking at, whichever one matches the most characters wins.
nil_identifier
is not matched as nil
<=
is matched as <=
and not <
followed by =