Given a language L(G), a parser consumes a sequence of tokens \(s\) and produces a parse tree
Issues:
How do we recognize that \(s \in L(G)\)?
A parse tree of \(s\) describes how \(s \in L(G)\)
Ambiguity: more than one parse tree for some string \(s\)
Error: no parse tree for some string \(s\)
How do we construct the parse tree?
So far, a parser traces the derivation of a sequence of tokens
The rest of the compiler needs a structural representation of the program
Abstract syntax trees (ASTs) are like parse trees, but ignore some details
Consider the grammar \[E \rightarrow int | (E) | E + E\]
and the string \[5 + (2 + 3)\]
After lexical analysis (a list of tokens) \[int(5), plus, lparen, int(2), plus, int(3), rparen\]
During parsing, we build a parse tree \(\ldots\)
Traces the operation of the parser
Captures the nesting structure
But has too much info, for example parentheses
Also captures the nesting structure
But abstracts from the concrete syntax making it more compact and easier to use
An important data structure in a compiler
Each grammar symbol may have attributes
An attribute is a property of a programming language construct
For terminal symbols attributes can be calculated by the lexer
Each production may have an action
Written as: \(X \rightarrow Y_1 \ldots Y_2 \{action\}\)
That can refer to or compute symbol attributes
This is what we will use to construct ASTs
Consider the grammar \[E \rightarrow int | (E) | E + E\]
For each symbol \(X\) define an attribute \(X.val\)
For terminals, \(val\) is the associated lexeme
For non-terminals, \(val\) is the expression’s value
We annotate the grammar with actions: \[\begin{aligned} E & \rightarrow int &\{&E.val = int.val\}\\ & \quad \vert \; (E_1) &\{&E.val = E_1.val\}\\ & \quad \vert \; E_1 + E_2 &\{&E.val = E_1.val + E_2.val\}\\ \end{aligned}\]
String: \(5 + (2 + 3)\)
Tokens: int(5), plus, lparen, int(2), plus, int(3), rparen
Productions | Equations |
\(E \rightarrow E_1 + E_2\) | \(E.val = E_1.val + E_2.val\) |
\(E_1 \rightarrow int(5)\) | \(E_1.val = int(5).val = 5\) |
\(E_2 \rightarrow (E_3)\) | \(E_2.val = E_3.val\) |
\(E_3 \rightarrow E_4 + E_5\) | \(E_3.val = E_4.val + E_5.val\) |
\(E_4 \rightarrow int(2)\) | \(E_4.val = int(2).val = 2\) |
\(E_5 \rightarrow int(3)\) | \(E_5.val = int(3).val = 3\) |
Semantic actions specify a system of equations, but the order of executing the actions is not specified
Example: \[E_3.val = E_4.val + E_5.val\]
Must compute \(E_4.val\) and \(E_5.val\) before \(E_3.val\)
We say that \(E_3.val\) depends on \(E_4.val\) and \(E_5.val\)
The parser must find the order of evaluation
An attribute must be computed after all its successors in the dependency graph have been computed
Such an order exists when there are no cycles
In the previous example, attributes can be computed bottom-up
Synthesized attributes
Calculated from attributes of descendants in the parse tree
\(E.val\) is a synthesized attribute
Can always be calculated in a bottom-up order
Grammars with only synthesized attributes are called S-attributed grammars
Inherited attributes
Each line contains an expression \[E \rightarrow int \; \vert \; E + E\]
Each line is terminated with the \(=\) sign \[L \rightarrow E = \; \vert \; + E =\]
In the second form, the value of evaluating the previous line is used as a starting value
A program is a sequence of lines \[P \rightarrow \epsilon \; \vert \; P L\]
Each \(E\) has a synthesized attribute \(val\)
Each \(L\) has a synthesized attribute \(val\) \[\begin{aligned} L & \rightarrow E = &\{&L.val = E.val\}\\ & \quad \vert \; + E = &\{&L.val = E.val + L.prev\}\\ \end{aligned}\]
We need the value of the previous line
We use an inherited attribute \(L.prev\)
Each \(P\) has a synthesized attribute \(val\) \[\begin{aligned} P & \rightarrow \epsilon &\{&P.val = 0\}\\ & \quad \vert \; P_1 L &\{&P.val = L.val;\\ & & &L.prev = P_1.val \}\\ \end{aligned}\]
Each \(L\) has an inherited attribute \(prev\)
Semantic actions can be used to build ASTs
And many other things, such as, type checking and code generation
This process is called syntax-directed translation – a substantial generalization over context-free grammars
We first define the AST data type
Consider an abstract tree type with two constructors:
mkleaf(n)
mkplus(left_tree, right_tree)
We define a synthesized attribute \(ast\)
Values of \(ast\) values are ASTs
We assume that \(int.lexval\) is the value of the integer lexeme
Computed using semantic actions
\[\begin{aligned} E & \rightarrow int &\{&E.ast = makeleaf(int.val)\}\\ & \quad \vert \; (E_1) &\{&E.ast = E_1.ast\}\\ & \quad \vert \; E_1 + E_2 &\{&E.ast = mkplus(E_1.ast, E_2.ast)\}\\ \end{aligned}\]
Consider the string: \(5 + (2 + 3)\)
A bottom-up evaluation of the \(ast\) attribute: \[\begin{aligned} E.ast = mkplus(&mkleaf(5),\\ &mkplus(mkleaf(2), mkleaf(3)))\\ \end{aligned}\]
We can specify language syntax using a context-free grammar
A parser will answer whether \(s \in L(G)\)
\(\ldots\) and will build a parse tree
\(\ldots\) which we convert to an AST
\(\ldots\) and pass on to the next phase