Introduction to Parsing
Outline
Regular languages revisited
Parser overview
Context-free grammars (CFGs)
Derivations
Ambiguity
Syntax errors
Languages and Automata
Formal languages are important in computer science, especially in programming languages.
Regular languages are the weakest formal languages that are widely used
We also need to study context-free languages
Limitations of Regular Languages
Intuition: A finite automaton that runs long enough must repeat states
A finite automaton cannot remember the number of times it has visited a particular state
A finite automaton has finite memory, so:
it can only store which state it is currently in, and
cannot count, except up to a finite limit.
Example, the language of balanced parentheses is not regular: \(\{(^{i} )^{i} \; \vert \; i \geq 0 \}\)
The Role of the Parser
The parsing phase of a compiler can be thought of as a function:
Input: sequence of tokens from the lexer
Output: parse tree of the program
Not all sequences of tokens are programs, so a parser must distinguish between valid and invalid sequences of tokens
So, we need
a language for describing valid sequences of tokens, and
a method for distinguishing valid from invalid sequences of tokens.
Context-Free Grammars
Many programming language constructs have a recursive structure
Example, a statement is of the form:
if condition then statement else statement, or
while condition do statement, or
\(\ldots\)
Context-free grammars (CFGs) are a natural notation for this recursive structure
Context-Free Grammars
A context-free grammar consists of
A set of terminals \(T\)
A set of non-terminals \(N\)
A non-terminal start symbol \(S\)
A set of productions
Assuming that \(X \in N\), productions are of the form
\(X \rightarrow \epsilon\), or
\(X \rightarrow Y_1 Y_2 \ldots Y_n\) where \(Y_i \in N \cup T\)
Notational Conventions
In these lecture notes
Non-terminals are written in uppercase
Terminals are written in lowercase
The start symbol is the left-hand side of the first production
CFG Example
A fragment of a simple language \[\begin{aligned} STMT & \rightarrow if \; COND \; then \; STMT \; else \; STMT\\ STMT & \rightarrow while \; COND \; do \; STMT\\ STMT & \rightarrow \; id \; = \; int \end{aligned}\]
Notational abbreviation \[\begin{aligned} STMT & \rightarrow if \; COND \; then \; STMT \; else \; STMT\\ & \quad \vert \; while \; COND \; do \; STMT\\ & \quad \vert \; id \; = \; int \end{aligned}\]
CFG Example
- Classic CFG example: simple arithmetic expressions \[\begin{aligned} E & \rightarrow E * E\\ & \quad \vert \; E + E\\ & \quad \vert \; (E)\\ & \quad \vert \; id \end{aligned}\]
The Language of a CFG
Productions can be read as replacement rules
\(X \rightarrow Y_1 \ldots Y_n\) means that \(X\) can be replaced by \(Y_1 \ldots Y_n\)
\(X \rightarrow \epsilon\) means that \(X\) can be erased (replaced with the empty string)
The Language of a CFG: Key Idea
Begin with a string consisting of the start symbol \(S\)
Replace any non-terminal \(X\) in the string by a right-hand side of some production \(X \rightarrow Y_1 \ldots Y_n\)
Repeat step 2 until there are no non-terminals in the string
The Language of a CFG
- Let \(G\) be a context-free grammar with start symbol \(S\). Then the language of \(G\) (\(L(G)\)) is: \[\{a_1 \ldots a_n \; \vert \; S \overset{*}{\rightarrow} a_1 \ldots a_n \land every \; a_i \in T \}\] where \[X_1 \ldots X_n \overset{*}{\rightarrow} Y_1 \ldots Y_m\] denotes \[X_1 \ldots X_n \rightarrow \ldots \rightarrow Y_1 \ldots Y_m\]
Terminals
A terminal has no rules for replacing it, hence the name terminal
Once a terminal is generated, it is permanent
Terminals ought to be the tokens of the language
Parentheses Example
Strings of balanced parentheses \(\{(^{i} )^{i} \; \vert \; i \geq 0 \}\)
Grammar \[\begin{aligned} S & \rightarrow (S)\\ & \quad \vert \; \epsilon \end{aligned}\]
Example
- A fragment of a simple language \[\begin{aligned} STMT & \rightarrow if \; COND \; then \; STMT \; else \; STMT\\ & \quad \vert \; while \; COND \; do \; STMT\\ & \quad \vert \; id \; = \; int\\ COND & \rightarrow (id == id)\\ & \quad \vert \; (id != id) \end{aligned}\]
Example Continued
Some elements of the language
id = int
if (id == id) then id = int else id = int
while (id != id) do id = int
while (id == id) do while (id != id) do id = int
Arithmetic Example
Simple arithmetic expressions: \[E \rightarrow E + E \; \vert \; E * E \; \vert \; (E) \; \vert \; id\]
Some elements of the language
id
(id)
(id) * id
id + id
Notes
The idea of a CFG is a big step
But,
Membership in a language is boolean; we also need the parse tree of the input
Must handle errors gracefully
Need an implementation of CFGs
Form of the grammar is important
Many grammars generate the same language
Parsing tools are sensitive to the grammar
Derivations and Parse Trees
A derivation is a sequence of productions \[S \rightarrow \ldots \rightarrow \ldots \rightarrow \ldots\]
A derivation can be depicted as a tree
The start symbol is the tree’s root
For a production \(X \rightarrow Y_1 \ldots Y_n\) add children \(Y_1 \ldots Y_n\) to node \(X\)
Derivation Example
Simple arithmetic expressions: \[E \rightarrow E + E \; \vert \; E * E \; \vert \; (E) \; \vert \; id\]
String \[id * id + id\]
Derivation Example
\[\begin{aligned} & E\\ \rightarrow & E + E\\ \rightarrow & E * E + E\\ \rightarrow & id * E + E\\ \rightarrow & id * id + E\\ \rightarrow & id * id + id \end{aligned}\]
Notes on Derivations
A parse tree has:
terminals at the leaves, and
non-terminals at the interior nodes
An in-order traversal of the leaves is the original input
The parse tree shows the association of the operations, the input string does not
Left-most and Right-most Derivations
The previous example was a left-most derivation
- At each step, replace the left-most non-terminal
There is an equivalent notion of a right-most derivation
- At each step, replace the right-most non-terminal
Right-most Derivation Example
\[\begin{aligned} & E\\ \rightarrow & E + E\\ \rightarrow & E + id\\ \rightarrow & E * E + id\\ \rightarrow & E * id + id\\ \rightarrow & id * id + id \end{aligned}\]
Derivations and Parse Trees
Note that right-most and left-most derivations have the same parse tree
The difference is the order in which branches are added
Summary of Derivations
We are not only interested in whether \(S \in L(G)\), we also need a parse tree for \(S\)
A derivation defines a parse tree, but one parse tree may have many derivations
Left-most and right-most derivations are important in the parser implementation
Ambiguity
Grammar \[E \rightarrow E + E \; \vert \; E * E \; \vert \; (E) \; \vert \; id\]
The string \(id * id + id\) has two parse trees:
Ambiguity
A grammar is ambiguous if it has more than one parse tree for some string
Ambiguity leaves the meaning of some programs ill-defined
Ambiguity is common in programming languages
Dealing with Ambiguity
There are several ways to handle ambiguity
The most direct method is to rewrite the grammar unambiguously
Example: enforcing precedence in the previous grammar \[\begin{aligned} E & \rightarrow T + E\\ & \quad \vert \; T\\ T & \rightarrow id * T\\ & \quad \vert \; id\\ & \quad \vert \; (E)\\ \end{aligned}\]
Ambiguity: The Dangling Else
Consider the following grammar \[\begin{aligned} S & \rightarrow if \; C \; then \; S\\ & \quad \vert \; if \; C \; then \; S \; else \; S\\ & \quad \vert \; OTHER\\ \end{aligned}\]
This grammar is ambiguous: the expression “\(if \; C_1 \; then \; if \; C_2 \; then \; S_3 \; else \;S_4\)” has two parse trees
The Dangling Else: a Fix
We want “else” to match the closest unmatched “then”
We can describe this in the grammar \[\begin{aligned} S & \rightarrow MIF\\ & \quad \vert \; UIF\\ MIF & \rightarrow if \; C \; then \; MIF \; else \; MIF\\ & \quad \vert \; OTHER\\ UIF & \rightarrow if \; C \; then \; S\\ & \quad \vert \; if \; C \; then \; MIF \; else \; UIF\\ \end{aligned}\]
Ambiguity
No general techniques for handling ambiguity
Impossible to automatically convert an ambiguous grammar to an unambiguous one
Used with care, ambiguity can simplify the grammar
Sometimes allows more natural definitions
but, we need disambiguation mechanisms
Precedence and Associativity Declarations
Instead of rewriting the grammar
use the more natural (ambiguous) grammar
along with disambiguating declarations
Most tools allow precedence and associativity declarations to disambiguate grammars
Error Handling
The purpose of the compiler is to
detect invalid programs
translate valid programs
Many kinds of possible errors
Error Kind Detected by Lexical Lexer Syntax Parser Semantic Type Checker Correctness Tester/User
Syntax Error Handling
Error handler should
report errors accurately and clearly
recover from an error quickly
not slow down the compilation of valid programs
Good error handling is typically difficult to achieve
Approaches to Syntax Error Recovery
From simple to complex
panic mode
error productions
automatic local or global correction
Not all are supported by all parser generator tools
Syntax Error Recovery: Panic Mode
Simplest, most popular method
When an error is detected:
discard tokens until one with a clear role is found
continue from there
Such tokens are called synchronizing tokens and are typically the statement or expression terminators
Syntax Error Recovery: Error Productions
Idea: specify in the grammar know common mistakes
Essentially promotes common errors to alternative syntax
Example
Common mistake: write “5 x” instead of “5 * x”
Fix: add the production “\(E \rightarrow \ldots \; \vert \; E E\)”
Disadvantage: this complicates the grammar
Syntax Error Recovery: Past and Present
Past
Slow recompilation cycle (even once a day)
Find as many errors in one cycle as possible
Researchers could not let go of the topic
Present
Quick recompilation cycle
Users tend to correct one error per cycle
Complex error recovery is needed less
Panic-mode seems good enough in practice