Introduction to Lexical Analysis
Outline
Informal sketch of lexical analysis
- Identifies tokens in input stream
Issues in lexical analysis
Lookahead
Ambiguities
Specifying lexers
- Regular Expressions
Lexical Analysis
The goal of lexical analysis is to partition an input string into substrings where each substring is a token.
Example:
if (i == j) z = 0; else z = 1;
is a string of characters:
if (i == j)\n\tz = 0;else\n\tz = 1;
A lexical analyzer is called a lexer or a scanner
Tokens
A token corresponds to a set of strings
These sets depend on the programming language
Examples:
Identifiers: strings of letters or digits starting with a digit
Integer: a non-empty string of digits
Keyword (reserved word): “if”, “else”, \(\ldots\)
Whitespace: a non-empty sequence of spaces, newlines, and tabs
What are Tokens used for?
Classify program substrings according to role
The output of lexical analysis is a stream of tokens
The input to the parser is a stream of tokens
The parser relies on token distinctions, for example, an identifier is treated differently than a keyword
Designing an Lexical Analyzer: Step 1
Define a finite set of tokens
Tokens describe all items of interest
Choice of tokens depends on language
Example: recall
if (i == j)\n\tz = 0;else\n\tz = 1;
Useful tokens:
Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;
Designing an Lexical Analyzer: Step 2
Describe which strings belong to each token
Recall:
Identifiers: strings of letters or digits starting with a digit
Integer: a non-empty string of digits
Keyword (reserved word): “if”, “else”, \(\ldots\)
Whitespace: a non-empty sequence of spaces, newlines, and tabs
Lexical Analyzer: Implementation
The implementation of a lexical analyzer must do two things:
Recognize substrings corresponding to tokens
Return the value or lexeme of the token; the lexeme is the substring
Example
Example: recall
if (i == j)\n\tz = 0;\nelse\tz = 1;
Token-lexeme groupings:
Identifier: i, j, z
Keyword: if, else
Relation: ==
Integer: 0, 1
Single characters: (, ), =, ;
Why do Lexical Analysis?
Simplify parsing
The lexer usually discards “uninteresting” tokens, for example, whitespace and comments
Converts data early
Separate the logic to read source files
Potentially an issue on multiple platforms
Can optimize reading source files independently of the parser
Difficulties
Lexical analysis can be difficult depending on the source language
Example: in FORTRAN whitespace is insignificant
VAR1
is the same asVA R1
Consider
DO 5 I = 1,25
versusDO 5 I = 1.25
Reading left-to-right, we cannot determine if
DO5I
is a variable orDO
statement until after “,” is reached
Important points:
The goal is to partition the string reading left-to-right, recognizing one token at a time
“Lookahead” may be required to decide where the token boundaries are
Review
The goal of lexical analysis is to:
Partition the input string into lexemes (the smallest program units that individually meaningful)
Identify the token of each lexeme
Left-to-right scan where sometimes lookahead is required
Next
We still need
A way to describe the lexemes of each token
A way to resolve ambiguities
Is
if
two variablesi
andf
or one keyword?Is
==
two equal signs or one operator?
Regular Languages
There are several formalisms for specifying tokens
Regular languages are the most popular
Simple and useful theory
Easy to understand
Efficient implementations
Languages
- Definition. Let \(\Sigma\) be a set of characters. A language over \(\Sigma\) is a set of strings of characters drawn from \(\Sigma\). \(\Sigma\) is called the alphabet.
Examples of Languages
Natural language
Alphabet: English characters
Language: English sentences
Note: not every string of English characters is an English sentence
Programming language
Alphabet: ASCII
Language: C programs
Note: The ASCII character set is different from the English character set
Regular Expressions
The lexical structure of most programming languages can be specified with regular expressions.
Languages are sets of strings - we need some notation for specifying which sets we want, that is, which strings are in the set.
A regular expression (RE) is a notation for a regular language
If \(A\) is a regular expression, then we write \(L(A)\) to refer to the language denoted by \(A\).
Fundamental Regular Expressions
\(A\) | \(L(A)\) | Notes |
---|---|---|
a | {a} | singleton set for each symbol ‘a’ in the alphabet \(\Sigma\) |
\(\epsilon\) | {\(\epsilon\)} | empty string |
\(\varnothing\) | { } | empty language |
- These are the basic building blocks of regular expressions.
Operations on Regular Expressions
\(A\) | \(L(A)\) | Notes |
---|---|---|
\(rs\) | \(L(r) L(s)\) | concatenation – \(r\) followed by \(s\) |
\(r | s\) | \(L(r) \cup L(s)\) | combination (union) – \(r\) or \(s\) |
\(r*\) | \(L(r)*\) | zero or more occurrences of \(r\) (Kleene closure) |
Precedence: \(*\) (highest), concatenation, \(|\) (lowest)
Parenthesis can be used to group REs as needed
We abbreviate ‘i’ ‘f’ as ‘if’ (concatenation)
Examples
\(L\)(if \(|\) then \(|\) else) = {“if”, “then”, “else”}
\(L((0 | 1) (0 | 1))\) = {“00”, “01”, “10”, “11”}
\(L(0*)\) = {"“,”0“,”00“,”000", \(\ldots\) }
\(L((1 | 0)(1 | 0)*)\) = set of binary numbers with possible leading zeros
Abbreviations
Abbreviation | Meaning | Notes |
---|---|---|
\(r+\) | \((rr*)\) | one or more occurrences |
\(r?\) | \((r | \epsilon)\) | zero or one occurrence |
\([a-z]\) | \((a | b | \ldots | z)\) | one character in given range |
\([abxyz]\) | \((a | b | x | y | z)\) | one of the given characters |
\([\)^\(abc]\) | \(\overline{[abc]}\) | any character except the given characters |
- The basic operations generate all possible regular expressions, but common abbreviations are used for convenience.