Regular languages revisited
Parser overview
Context-free grammars (CFGs)
Derivations
Ambiguity
Syntax errors
Formal languages are important in computer science, especially in programming languages.
Regular languages are the weakest formal languages that are widely used
We also need to study context-free languages
Intuition: A finite automaton that runs long enough must repeat states
A finite automaton cannot remember the number of times it has visited a particular state
A finite automaton has finite memory, so:
it can only store which state it is currently in, and
cannot count, except up to a finite limit.
Example, the language of balanced parentheses is not regular: \(\{(^{i} )^{i} \; \vert \; i \geq 0 \}\)
The parsing phase of a compiler can be thought of as a function:
Input: sequence of tokens from the lexer
Output: parse tree of the program
Not all sequences of tokens are programs, so a parser must distinguish between valid and invalid sequences of tokens
So, we need
a language for describing valid sequences of tokens, and
a method for distinguishing valid from invalid sequences of tokens.
Many programming language constructs have a recursive structure
Example, a statement is of the form:
if condition then statement else statement, or
while condition do statement, or
\(\ldots\)
Context-free grammars (CFGs) are a natural notation for this recursive structure
A context-free grammar consists of
A set of terminals \(T\)
A set of non-terminals \(N\)
A non-terminal start symbol \(S\)
A set of productions
Assuming that \(X \in N\), productions are of the form
\(X \rightarrow \epsilon\), or
\(X \rightarrow Y_1 Y_2 \ldots Y_n\) where \(Y_i \in N \cup T\)
In these lecture notes
Non-terminals are written in uppercase
Terminals are written in lowercase
The start symbol is the left-hand side of the first production
A fragment of a simple language \[\begin{aligned} STMT & \rightarrow if \; COND \; then \; STMT \; else \; STMT\\ STMT & \rightarrow while \; COND \; do \; STMT\\ STMT & \rightarrow \; id \; = \; int \end{aligned}\]
Notational abbreviation \[\begin{aligned} STMT & \rightarrow if \; COND \; then \; STMT \; else \; STMT\\ & \quad \vert \; while \; COND \; do \; STMT\\ & \quad \vert \; id \; = \; int \end{aligned}\]
Productions can be read as replacement rules
\(X \rightarrow Y_1 \ldots Y_n\) means that \(X\) can be replaced by \(Y_1 \ldots Y_n\)
\(X \rightarrow \epsilon\) means that \(X\) can be erased (replaced with the empty string)
Begin with a string consisting of the start symbol \(S\)
Replace any non-terminal \(X\) in the string by a right-hand side of some production \(X \rightarrow Y_1 \ldots Y_n\)
Repeat step 2 until there are no non-terminals in the string
A terminal has no rules for replacing it, hence the name terminal
Once a terminal is generated, it is permanent
Terminals ought to be the tokens of the language
Strings of balanced parentheses \(\{(^{i} )^{i} \; \vert \; i \geq 0 \}\)
Grammar \[\begin{aligned} S & \rightarrow (S)\\ & \quad \vert \; \epsilon \end{aligned}\]
Some elements of the language
id = int
if (id == id) then id = int else id = int
while (id != id) do id = int
while (id == id) do while (id != id) do id = int
Simple arithmetic expressions: \[E \rightarrow E + E \; \vert \; E * E \; \vert \; (E) \; \vert \; id\]
Some elements of the language
id
(id)
(id) * id
id + id
The idea of a CFG is a big step
But,
Membership in a language is boolean; we also need the parse tree of the input
Must handle errors gracefully
Need an implementation of CFGs
Form of the grammar is important
Many grammars generate the same language
Parsing tools are sensitive to the grammar
A derivation is a sequence of productions \[S \rightarrow \ldots \rightarrow \ldots \rightarrow \ldots\]
A derivation can be depicted as a tree
The start symbol is the tree’s root
For a production \(X \rightarrow Y_1 \ldots Y_n\) add children \(Y_1 \ldots Y_n\) to node \(X\)
Simple arithmetic expressions: \[E \rightarrow E + E \; \vert \; E * E \; \vert \; (E) \; \vert \; id\]
String \[id * id + id\]
\[\begin{aligned} & E\\ \rightarrow & E + E\\ \rightarrow & E * E + E\\ \rightarrow & id * E + E\\ \rightarrow & id * id + E\\ \rightarrow & id * id + id \end{aligned}\]
A parse tree has:
terminals at the leaves, and
non-terminals at the interior nodes
An in-order traversal of the leaves is the original input
The parse tree shows the association of the operations, the input string does not
The previous example was a left-most derivation
There is an equivalent notion of a right-most derivation
\[\begin{aligned} & E\\ \rightarrow & E + E\\ \rightarrow & E + id\\ \rightarrow & E * E + id\\ \rightarrow & E * id + id\\ \rightarrow & id * id + id \end{aligned}\]
Note that right-most and left-most derivations have the same parse tree
The difference is the order in which branches are added
We are not only interested in whether \(S \in L(G)\), we also need a parse tree for \(S\)
A derivation defines a parse tree, but one parse tree may have many derivations
Left-most and right-most derivations are important in the parser implementation
Grammar \[E \rightarrow E + E \; \vert \; E * E \; \vert \; (E) \; \vert \; id\]
The string \(id * id + id\) has two parse trees:
A grammar is ambiguous if it has more than one parse tree for some string
Ambiguity leaves the meaning of some programs ill-defined
Ambiguity is common in programming languages
There are several ways to handle ambiguity
The most direct method is to rewrite the grammar unambiguously
Example: enforcing precedence in the previous grammar \[\begin{aligned} E & \rightarrow T + E\\ & \quad \vert \; T\\ T & \rightarrow id * T\\ & \quad \vert \; id\\ & \quad \vert \; (E)\\ \end{aligned}\]
Consider the following grammar \[\begin{aligned} S & \rightarrow if \; C \; then \; S\\ & \quad \vert \; if \; C \; then \; S \; else \; S\\ & \quad \vert \; OTHER\\ \end{aligned}\]
This grammar is ambiguous: the expression “\(if \; C_1 \; then \; if \; C_2 \; then \; S_3 \; else \;S_4\)” has two parse trees
We want “else” to match the closest unmatched “then”
We can describe this in the grammar \[\begin{aligned} S & \rightarrow MIF\\ & \quad \vert \; UIF\\ MIF & \rightarrow if \; C \; then \; MIF \; else \; MIF\\ & \quad \vert \; OTHER\\ UIF & \rightarrow if \; C \; then \; S\\ & \quad \vert \; if \; C \; then \; MIF \; else \; UIF\\ \end{aligned}\]
No general techniques for handling ambiguity
Impossible to automatically convert an ambiguous grammar to an unambiguous one
Used with care, ambiguity can simplify the grammar
Sometimes allows more natural definitions
but, we need disambiguation mechanisms
Instead of rewriting the grammar
use the more natural (ambiguous) grammar
along with disambiguating declarations
Most tools allow precedence and associativity declarations to disambiguate grammars
The purpose of the compiler is to
detect invalid programs
translate valid programs
Many kinds of possible errors
Error Kind | Detected by |
---|---|
Lexical | Lexer |
Syntax | Parser |
Semantic | Type Checker |
Correctness | Tester/User |
Error handler should
report errors accurately and clearly
recover from an error quickly
not slow down the compilation of valid programs
Good error handling is typically difficult to achieve
From simple to complex
panic mode
error productions
automatic local or global correction
Not all are supported by all parser generator tools
Simplest, most popular method
When an error is detected:
discard tokens until one with a clear role is found
continue from there
Such tokens are called synchronizing tokens and are typically the statement or expression terminators
Idea: specify in the grammar know common mistakes
Essentially promotes common errors to alternative syntax
Example
Common mistake: write “5 x” instead of “5 * x”
Fix: add the production “\(E \rightarrow \ldots \; \vert \; E E\)”
Disadvantage: this complicates the grammar
Past
Slow recompilation cycle (even once a day)
Find as many errors in one cycle as possible
Researchers could not let go of the topic
Present
Quick recompilation cycle
Users tend to correct one error per cycle
Complex error recovery is needed less
Panic-mode seems good enough in practice