Introduction to Lexical Analysis

CSC 310 - Programming Languages

Outline

Informal sketch of lexical analysis
- Identifies tokens in input stream
Issues in lexical analysis
- Lookahead
- Ambiguities
Specifying lexers
- Regular Expressions

Lexical Analysis

The goal of lexical analysis is to partition an input string into substrings where each substring is a token.

Example:

if (i == j)
    z = 0;
else
    z = 1;

is a string of characters:

if (i == j)\n\tz = 0;else\n\tz = 1;

A lexical analyzer is called a lexer or a scanner

Tokens

A token corresponds to a set of strings
These sets depend on the programming language
Examples:
- Identifiers: strings of letters or digits starting with a digit
- Integer: a non-empty string of digits
- Keyword (reserved word): “if”, “else”, \(\ldots\)
- Whitespace: a non-empty sequence of spaces, newlines, and tabs

What are Tokens used for?

Classify program substrings according to role
The output of lexical analysis is a stream of tokens
The input to the parser is a stream of tokens
The parser relies on token distinctions, for example, an identifier is treated differently than a keyword

Designing an Lexical Analyzer: Step 1

Define a finite set of tokens
- Tokens describe all items of interest
- Choice of tokens depends on language
Example: recall
```
if (i == j)\n\tz = 0;else\n\tz = 1;
```
Useful tokens:
Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;

Designing an Lexical Analyzer: Step 2

Describe which strings belong to each token
Recall:
- Identifiers: strings of letters or digits starting with a digit
- Integer: a non-empty string of digits
- Keyword (reserved word): “if”, “else”, \(\ldots\)
- Whitespace: a non-empty sequence of spaces, newlines, and tabs

Lexical Analyzer: Implementation

The implementation of a lexical analyzer must do two things:
1. Recognize substrings corresponding to tokens
2. Return the value or lexeme of the token; the lexeme is the substring

Example

Example: recall
```
if (i == j)\n\tz = 0;\nelse\tz = 1;
```
Token-lexeme groupings:
- Identifier: i, j, z
- Keyword: if, else
- Relation: ==
- Integer: 0, 1
- Single characters: (, ), =, ;

Why do Lexical Analysis?

Simplify parsing
- The lexer usually discards “uninteresting” tokens, for example, whitespace and comments
- Converts data early
Separate the logic to read source files
- Potentially an issue on multiple platforms
- Can optimize reading source files independently of the parser

Difficulties

Lexical analysis can be difficult depending on the source language
Example: in FORTRAN whitespace is insignificant
- VAR1 is the same as VA R1
- Consider DO 5 I = 1,25 versus DO 5 I = 1.25
- Reading left-to-right, we cannot determine if DO5I is a variable or DO statement until after “,” is reached
Important points:
- The goal is to partition the string reading left-to-right, recognizing one token at a time
- “Lookahead” may be required to decide where the token boundaries are

Review

The goal of lexical analysis is to:
- Partition the input string into lexemes (the smallest program units that individually meaningful)
- Identify the token of each lexeme
Left-to-right scan where sometimes lookahead is required

We still need
- A way to describe the lexemes of each token
- A way to resolve ambiguities
  - Is if two variables i and f or one keyword?
  - Is == two equal signs or one operator?

Regular Languages

There are several formalisms for specifying tokens
Regular languages are the most popular
- Simple and useful theory
- Easy to understand
- Efficient implementations

Languages

Definition. Let \(\Sigma\) be a set of characters. A language over \(\Sigma\) is a set of strings of characters drawn from \(\Sigma\). \(\Sigma\) is called the alphabet.

Examples of Languages

Natural language
- Alphabet: English characters
- Language: English sentences
- Note: not every string of English characters is an English sentence
Programming language
- Alphabet: ASCII
- Language: C programs
- Note: The ASCII character set is different from the English character set

Regular Expressions

The lexical structure of most programming languages can be specified with regular expressions.
Languages are sets of strings - we need some notation for specifying which sets we want, that is, which strings are in the set.
A regular expression (RE) is a notation for a regular language
If \(A\) is a regular expression, then we write \(L(A)\) to refer to the language denoted by \(A\).

Fundamental Regular Expressions

\(A\)	\(L(A)\)	Notes
a	{a}	singleton set for each symbol ‘a’ in the alphabet \(\Sigma\)
\(\epsilon\)	{\(\epsilon\)}	empty string
\(\varnothing\)	{ }	empty language

These are the basic building blocks of regular expressions.

Operations on Regular Expressions

\(A\)	\(L(A)\)	Notes
\(rs\)	\(L(r) L(s)\)	concatenation – \(r\) followed by \(s\)
\(r \| s\)	\(L(r) \cup L(s)\)	combination (union) – \(r\) or \(s\)
\(r*\)	\(L(r)*\)	zero or more occurrences of \(r\) (Kleene closure)

Precedence: \(*\) (highest), concatenation, \(|\) (lowest)
Parenthesis can be used to group REs as needed
We abbreviate ‘i’ ‘f’ as ‘if’ (concatenation)

Examples

\(L\)(if \(|\) then \(|\) else) = {“if”, “then”, “else”}
\(L((0 | 1) (0 | 1))\) = {“00”, “01”, “10”, “11”}
\(L(0*)\) = {"“,”0“,”00“,”000", \(\ldots\) }
\(L((1 | 0)(1 | 0)*)\) = set of binary numbers with possible leading zeros

Abbreviations

Abbreviation	Meaning	Notes
\(r+\)	\((rr*)\)	one or more occurrences
\(r?\)	\((r \| \epsilon)\)	zero or one occurrence
\([a-z]\)	\((a \| b \| \ldots \| z)\)	one character in given range
\([abxyz]\)	\((a \| b \| x \| y \| z)\)	one of the given characters
\([\)^\(abc]\)	\(\overline{[abc]}\)	any character except the given characters

The basic operations generate all possible regular expressions, but common abbreviations are used for convenience.

Introduction to Lexical Analysis

CSC 310 - Programming Languages

Outline

Lexical Analysis

Tokens

What are Tokens used for?

Designing an Lexical Analyzer: Step 1

Designing an Lexical Analyzer: Step 2

Lexical Analyzer: Implementation

Example

Why do Lexical Analysis?

Difficulties

Review

Next

Regular Languages

Languages

Examples of Languages

Regular Expressions

Fundamental Regular Expressions

Operations on Regular Expressions

Examples

Abbreviations