Top-Down Parsing
Outline
Implementation of parsers
Two main approaches:
Top-down
Bottom-up
This lecture: Top-Down
- Easier to understand and program manually
Next time: Bottom-Up
- More powerful and used by most parser generators
Introduction to Top-Down Parsing
Terminals are seen in order of appearance in the token stream: \(t_2, t_5, t_6, t_8, t_9\)
The parse tree is constructed: from top to bottom and from left to right
Recursive Descent Parsing
Consider the grammar \[\begin{aligned} E & \rightarrow T + E \; \vert \; T\\ T & \rightarrow int \; \vert \; int * T \; \vert \; (E)\\ \end{aligned}\]
and token stream: \(int(5) * int(2)\)
Start with the top-level non-terminal \(E\)
Try the rules for \(E\) in order
Recursive Descent Parsing
Try \(E_0 \rightarrow T_1 + E_2\)
Try \(T_1 \rightarrow (E_3)\)
- The left parenthesis does not match the token \(int(5)\)
Try \(T_1 \rightarrow int\)
- Matches, but \(+\) after \(T_1\) does not match \(*\)
Try \(T_1 \rightarrow int * T_2\)
Matches and consumes two tokens
Try \(T_2 \rightarrow int\) matches, but \(+\) after \(T_1\) does not
Try \(T_2 \rightarrow int * T_3\) but \(+\) does not match end-of-input
Has exhausted the choices for \(T_2\)
- Backtrack to choice for \(E_0\)
Recursive Descent Parsing
Try \(E_0 \rightarrow T_1\)
Follow same steps as before for \(T_1\)
and succeed with \(T_1 \rightarrow int(5) * T_2\) and \(T_2 \rightarrow int(2)\)
with the following parse tree
Recursive Descent Parsing: Notes
Easy to implement by hand
Somewhat inefficient due to backtracking
Does not always work \(\ldots\)
When Recursive Descent Does Not Work
Consider a production \(S \rightarrow S a\)
And the following pseudo-code implementation
bool S1() { return S() && term(a); } bool S() { return S1(); }
The function call \(S()\) gets into an infinite loop
A left-recursive grammar has a non-terminal \(S\) and production \(S \overset{+}{\rightarrow} S \alpha\) for some \(\alpha\)
Recursive descent does not work in such cases
Elimination of Left Recursion
Consider the left-recursive grammar \[S \rightarrow S \alpha \; \vert \; \beta\]
\(S\) generates all strings starting with a \(\beta\) and followed by any number of \(\alpha\)s
This grammar can be rewritten using right-recursion \[\begin{aligned} S & \rightarrow \beta S'\\ S' & \rightarrow \alpha S' \; \vert \; \epsilon\\ \end{aligned}\]
Elimination of Left-Recursion
In general \[S \rightarrow S \alpha_1 \; \vert \; \ldots \; \vert \; S \alpha_n \; \vert \; \beta_1 \; \vert \; \ldots \; \vert \; \beta_m\]
All strings derived from \(S\) start with one of \(\beta_1, \ldots, \beta_m\) and continue with several instances of \(\alpha_1, \ldots, \alpha_n\)
Rewrite as \[\begin{aligned} S & \rightarrow \beta_1 S' \; \vert \; \ldots \; \vert \; \beta_m S'\\ S' & \rightarrow \alpha_1 S' \; \vert \; \ldots \; \vert \; \alpha_n S' \; \vert \; \epsilon\\ \end{aligned}\]
General Left Recursion
The grammar \[\begin{aligned} S & \rightarrow A \alpha \; \vert \; \delta\\ A & \rightarrow S \beta\\ \end{aligned}\]
is also left-recursive because \[S \overset{+}{\rightarrow} S \beta \alpha\]
This left-recursion can also be eliminated (see a compilers text for a general algorithm)
Summary of Recursive Descent
Simple and general parsing strategy
left-recursion must be eliminated first
\(\ldots\) but that can be done automatically
Unpopular because of backtracking (thought to be too inefficient)
In practice, backtracking is eliminated by restricting the grammar
Predictive Parsers
Like recursive descent, but the parser can “predict” which production to use by looking at the next few tokens and does not need to backtrack
Predictive parsers accept \(LL(K)\) grammars
\(L\) means left-to-right scan of input
\(L\) means leftmost derivation
\(k\) means predict based on \(k\) tokens of lookahead
In practice, \(LL(1)\) is used
\(LL(1)\) Languages
In recursive descent, there may be multiple production choices for each non-terminal and input token
\(LL(1)\) means that there is only one production for each non-terminal and input token
Can be specified via 2D tables
one dimension for the current non-terminal to expand
one dimension for the next token
a table entry contains one production
Predictive Parsing and Left Factoring
Recall the grammar for arithmetic expressions \[\begin{aligned} E & \rightarrow T + E \; \vert \; T\\ T & \rightarrow (E) \; \vert \; int \; \vert \; int * T\\ \end{aligned}\]
Difficult to predict because:
For \(T\) two productions start with \(int\)
For \(E\) it is not clear how to predict
A grammar must be left-factored before it is used for predictive parsing
Left-Factoring Example
Recall the grammar for arithmetic expressions \[\begin{aligned} E & \rightarrow T + E \; \vert \; T\\ T & \rightarrow (E) \; \vert \; int \; \vert \; int * T\\ \end{aligned}\]
Factor out common prefixes of productions \[\begin{aligned} E & \rightarrow T \; X\\ X & \rightarrow + E \; \vert \; \epsilon\\ T & \rightarrow (E) \; \vert \; int \; Y\\ Y & \rightarrow * T \; \vert \; \epsilon \end{aligned}\]
\(LL(1)\) Parsing Table Example
Left-factored grammar \[\begin{aligned} E & \rightarrow T \; X\\ X & \rightarrow + E \; \vert \; \epsilon\\ T & \rightarrow (E) \; \vert \; int \; Y\\ Y & \rightarrow * T \; \vert \; \epsilon \end{aligned}\]
The \(LL(1)\) parsing table:
\(int\) \(*\) \(+\) \((\) \()\) \(\$\) \(E\) \(T \; X\) \(T \; X\) \(X\) \(+E\) \(\epsilon\) \(\epsilon\) \(T\) \(int \; Y\) \((E)\) \(Y\) \(*T\) \(\epsilon\) \(\epsilon\) \(\epsilon\)
\(LL(1)\) Parsing Table Example
Consider the \([E, int]\) entry
If the current non-terminal is \(E\) and the next input is \(int\), then use production \(E \rightarrow T \; X\)
This production can generate an \(int\) in the first place
Consider the \([Y, +]\) entry
If the current non-terminal is \(Y\) and the current input is \(+\), then eliminate \(Y\)
\(Y\) can be followed by \(+\) only in a derivation in which \(Y \rightarrow \epsilon\)
Blank entries indicate error situations
Consider the \([E,*]\) entry
There is no way to derive a string starting with \(*\) from non-terminal \(E\)
Using Parsing Tables
Method similar to recursive descent, except
For each non-terminal \(S\)
we look at the next token \(a\)
and chose the production shown at \([S,a]\)
We use a stack to keep track of pending non-terminals
We reject when we encounter an error state
We accept when we encounter end-of-input
\(LL(1)\) Parsing Algorithm
initialize stack = <S, $> and next
repeat
case stack of
<X, rest> : if T[X, *next] = Y1 ... Yn
then stack := <Y1 ... Yn rest>
else error()
<t, rest> : if t == *next++
then stack := <rest>
else error()
until stack == <>
\(LL(1)\) Parsing Example
Stack | Input | Action |
---|---|---|
\(E \; \$\) | \(int * int \; \$\) | \(T \; X\) |
\(T \; X \; \$\) | \(int * int \; \$\) | \(int \; Y\) |
\(int \; Y \; \; X \; \$\) | \(int * int \; \$\) | terminal |
\(Y \; \; X \; \$\) | \(* int \; \$\) | \(* \; T\) |
\(* \; T \; X \; \$\) | \(* int \; \$\) | terminal |
\(T \; X \; \$\) | \(int \; \$\) | \(int \; Y\) |
\(int \; Y \; X \; \$\) | \(int \; \$\) | terminal |
\(Y \; X \; \$\) | \(\$\) | \(\epsilon\) |
\(X \; \$\) | \(\$\) | \(\epsilon\) |
\(\$\) | \(\$\) | ACCEPT |
Constructing Parsing Tables
\(LL(1)\) languages are those defined by a parsing table for the \(LL(1)\) algorithm
No table entry can be multiply defined
We want to generate parsing tables from context-free grammars
Constructing Parsing Tables
If \(A \rightarrow \alpha\), where in the row of \(A\) do we place \(\alpha\)?
In the column of \(t\) where \(t\) can start a string derived from \(\alpha\)
\(\alpha \rightarrow t \; \beta\)
we say that \(t \in First(\alpha)\)
In the column of \(t\) if \(\alpha\) is \(\epsilon\) and \(t\) can follow an \(A\)
\(S \overset{*}{\rightarrow} \beta \; A \; t \; \delta\)
we say \(t \in Follow(A)\)
Computing First Sets
Definition: \(First(X) = \{ t \; \vert \; \overset{*}{\rightarrow} t \alpha \} \cup \{ \epsilon \; \vert \; X \overset{*}{\rightarrow} \epsilon \}\)
Algorithm sketch
\(First(t) = \{ t \}\)
\(\epsilon \in First(X) \; \text{if} \; X \rightarrow \epsilon\) is a production
\(\epsilon \in First(X) \; \text{if} \; X \rightarrow A_1 \ldots A_n \; \text{and} \; \epsilon \in First(A_i)\) for each \(1 \leq i \leq n\)
\(First(\alpha) \subseteq First(X) \; \text{if} \; X \rightarrow A_1 \ldots A_n \alpha \; \text{and} \; \epsilon \in First(A_i)\) for each \(1 \leq i \leq n\)
First Sets Example
Recall the grammar \[\begin{aligned} E & \rightarrow T \; X\\ X & \rightarrow + E \; \vert \; \epsilon\\ T & \rightarrow (E) \; \vert \; int \; Y\\ Y & \rightarrow * T \; \vert \; \epsilon \end{aligned}\]
First sets
\(First(\;(\;) = \{\;(\;\}\) \(First(\;)\;) = \{\;)\;\}\) \(First(+) = \{+\}\) \(First(*) = \{*\}\) \(First(int) = \{int\}\) \(First(T) = \{int,(\}\) \(First(E) = \{int, (\}\) \(First(X) = \{+,\epsilon\}\) \(First(Y) = \{*, \epsilon\}\)
Computing Follow Sets
Definition: \(Follow(X) = \{ t \; \vert \; \overset{*}{\rightarrow} \beta \; X \; t \; \delta \}\)
Intuition
If \(X \rightarrow A \; B\), then \(First(B) \subseteq Follow(A)\) and \(Follow(X) \subseteq Follow(B)\)
Also, if \(B \overset{*}{\rightarrow} \epsilon\), then \(Follow(X) \subseteq Follow(A)\)
IF \(S\) is the start symbol, then \(\$ \in Follow(S)\)
Algorithm sketch
\(\$ \in Follow(S)\)
\(First(\beta) - \{\epsilon\} \subseteq Follow(X)\) for each production \(A \rightarrow \alpha \; X \; \beta\)
\(Follow(A) \subseteq Follow(X)\) for each production \(A \rightarrow \alpha \; X \; \beta\) where \(\epsilon \in First(\beta)\)
Follow Sets Example
Recall the grammar \[\begin{aligned} E & \rightarrow T \; X\\ X & \rightarrow + E \; \vert \; \epsilon\\ T & \rightarrow (E) \; \vert \; int \; Y\\ Y & \rightarrow * T \; \vert \; \epsilon \end{aligned}\]
First sets
\(Follow(\;(\;) = \{int, (\}\) \(Follow(\;)\;) = \{+,),\$\}\) \(Follow(+) = \{int,(\}\) \(Follow(*) = \{int,(\}\) \(Follow(int) = \{*,+,),\$\}\) \(Follow(T) = \{+, ), \$\}\) \(Follow(E) = \{),\$\}\) \(Follow(X) = \{\$, )\}\) \(Follow(Y) = \{+, ), \$\}\)
Constructing \(LL(1)\) Parsing Tables
Construct a parsing table \(T\) for context-free grammar \(G\)
For each production \(A \rightarrow \alpha\) in \(G\) do:
For each terminal \(t \in First(\alpha)\) do \(T[A,t] = \alpha\)
If \(\epsilon \in First(\alpha)\), then for each \(t \in Follow(A)\) do \(T[A,t] = \alpha\)
If \(\epsilon \in First(\alpha)\) and \(\$ \in Follow(A)\) do \(T[A,\$] = \alpha\)
Notes on \(LL(1)\) Parsing Tables
If any entry is multiply defined, then \(G\) is not \(LL(1)\)
If \(G\) is ambiguous
If \(G\) is left recursive
If \(G\) is not left factored
And in other cases as well
Most programming languages are not \(LL(1)\)
There are tools that build \(LL(1)\) tables
For some grammars, predictive parsing is a simple parsing strategy