Scoping and Type Checking

CSC 310 - Programming Languages

Outline

The role of semantic analysis in a compiler
Scope
- static vs. dynamic scoping
- implementation: symbol tables
Types
- static analyses that detect type errors
- statically vs. dynamically typed languages

The Compiler Front-End

Lexical analysis: the program is lexically well-formed
- tokens are legal
- detects inputs with illegal tokens
Parsing
- declarations have correct structure, expressions are syntactically valid, etc.
- detects inputs with ill-formed syntax
Semantic analysis
- last "front end" compilation phase
- catches all remaining errors

Beyond Syntax Errors

Example C program semantic errors:

foo(int a, char *s){...}

int bar() {
  int f[3];
  int i, j, k;
  char q, *p;
  float k;
  foo(f[6], 10, j);
  break;
  i->val = 42;
  j = m + k;
  printf("%s,%s.\n",p,q);
  goto label42;
}

Beyond Syntax Errors (continued)

Example C program semantic errors:
- Undeclared identifier
- Multiple declarations of identifier
- Index out of bounds
- Incorrect number or types of arguments to function call
- Incompatible types for operation
- A break statement outside of a loop
- A goto with no label

Why Have a Separate Semantic Analysis Phase?

Parsing cannot catch some errors
Some language constructs are not context-free
- Example: All used variables must have been declared (that is, scoping)
- Example: A method must be invoked with arguments of proper type (that is, typing)

What Does Semantic Analysis Do?

Performs checks beyond syntax of many kinds
Examples for Cool:
- All used identifiers are declared
- Static types
- Inheritance relationships
- Classes defined only once
- Methods in a class defined only once
- Reserved identifiers are not misused
The requirements depend on the language

Scope

The scope of an identifier (a binding of a name to the entity it names) is the textual part of the program in which the binding is active
Scope matches identifier declarations with uses, an important static analysis step in most languages
The scope of an identifier is the portion of a program in which that identifier is accessible
The same identifier may refer to different things in different parts of the program
An identifier may have restricted scope

Static vs. Dynamic Scope

Most languages have static (lexical) scope
- Scope depends only on the physical structure of program text, not its run-time behavior
- The determination of scope is made by the compiler
A few languages are dynamically scoped
- Scope depends on execution of the program

Static Scoping Example

Uses of x refer to the closest enclosing function

let integer x := 0 in
{
  x;
  let integer x := 1 in
    x;
  x;
}

Static vs. Dynamic Scope

Example

program scopes(input, output);
var a: integer;
procedure first;
  begin a := 1; end;
procedure second;
  var a: integer;
  begin first; end;
begin
  a := 2; second; write(a);
end.

With static scope, the result is 2
With dynamic scope, the result is 1

Scope in Cool

Cool identifier bindings are introduced by:
- Class declarations (introduce class names)
- Method definitions (introduce method names)
- Let expressions (introduce object identifiers)
- Formal parameters (introduce object identifiers)
- Attribute definitions in a class (introduce object identifiers)
- Case expressions (introduce object identifiers)

Scope of Identifiers

In most programming languages identifier bindings are introduced by
- Function declarations (introduce function names)
- Procedure definitions (introduce procedure names)
- Identifier declarations (introduce identifiers)
- Formal parameters (introduce identifiers)

Implementing the Most Closely Nested Rule

Much of semantic analysis can be expressed as a recursive descent of an AST
- Process an AST node $n$
- Process the children of $n$
- Finish processing node $n$
When performing semantic analysis on a portion of the AST, we need to know which identifiers are defined.

Implementing the Most Closely Nested Rule

Example: the scope of variable declarations is one subtree
```
let x : Int <- 0 in E
```
x can be used in subtree E

Symbol Tables

Purpose: to hold information about identifiers that is computed at some point and looked up at later times during compilation
Example information:
- type of a variable
- entry point for a function
Operations: insert, lookup, delete
Common implementations: linked lists, hash tables

Symbol Tables

Assuming static scope, consider again
```
let x : Int <- 1 in E
```
Idea:
- before processing E, add a definition of x to the current definitions, overriding any other definition of x
- after processing E, remove the definition of x and, if needed, restore old definition of x
A symbol table is a data structure that tracks the current bindings of identifiers

Scope in Cool

Not all kinds of identifiers follow the most-closely nested rule
For example, class definitions in Cool
- Cannot be nested
- Are globally visible throughout the program
In other words, a class name can be used before it is defined

Scope in Cool (Continued)

Attribute names are global within the class in which they are defined
```
Class Foo {
    f(): Int { tm };
    tm : Int <- 0;
};
```

Scope in Cool (Continued)

Method and attribute names have complex rules
A method need not be defined in the class in which it is used, but in some parent class (this is standard inheritance)
Methods may also be redefined (overridden)

Class Definitions

Class names can be used before being defined
We cannot check this property
- using a symbol table
- or even in one pass
Solution
- Pass 1: Collect all class names
- Pass 2: Do the checking
Semantic analysis requires multiple passes (probably more that two)

Types

What is a type?
- This is the subject of some debate
- The notion varies from language to language
Consensus
- A type is a set of values and
- A set of operations on those values
Classes are one instantiation of the modern notion of type

Types and Operations

Consider the assembly language fragment
```
addi $r1, $r2, $r3
```
What are the types of $r1, $r2, and $r3?
Certain operations are legal for values of each type
- It does not make sense to add a function pointer and an integer in C
- It does make sense to add two integers
- But, both have the same assembly language implementation

Type Systems

A language’s type system specifies which operations are valid for which types
The goal of type checking is to ensure that operations are used with the correct types
- Enforces intended interpretation of values, because nothing else will
Type systems provide a concise formalization of the semantic checking rules

What Can Types do For Us?

Allow for a more efficient compilation of programs
- Allocate the correct amount of space for variables
- Select the correct machine instructions
Statically detect certain kinds of errors
- Memory errors (reading from an invalid pointer, etc.)
- Violation of abstraction boundaries
- Security and access rights violations

Type Checking Overview

Three kinds of languages
- Statically typed: all or almost all checking of types is done as part of compilation
- Dynamically typed: almost all checking of types is done as part of program execution
- Untyped: no checking (machine code)

The Type Wars

Competing views on static vs. dynamic typing
Static typing proponents say:
- Static checking catches many programming errors at compile time
- Avoids overhead of runtime type checks
Dynamic typing proponents say:
- Static type systems are restrictive
- Rapid protoyping is easier in a dynamic type system

Cool Types

The types are:
- Class names
- SELF_TYPE
There are no unboxed base types
The user declares types for all identifiers
The compiler infers types for expressions

Type Checking and Type Inference

Type checking is the process of verifying fully typed programs
Type inference is the process of filling in missing type information
The two are different, but are often used interchangeably

Rules of Inference

We have seen two examples of formal notation for specifying parts of a compiler
- Regular expressions (for the lexer)
- Context-free grammars (for the parser)
The appropriate formalism for type checking is logical rules of inference

Why Rules of Inference?

Inference rues have the form: If Hypothesis is true, then Conclusion is true
Type checking computes via reasoning: If $E_1$ and $E_2$ have certain types, then $E_3$ has a certain type
Rules of inference are a compact notation for “If-Then” statements

From English to an Inference Rule

The notation is easy to read (with practice)
Start with a simplified system and gradually add features
Building blocks:
- Symbol $\land$ is “and”
- Symbol $\Rightarrow$ is “if-then”
- $x:T$ is “$x$” has type “$T$”
Example:
- If $e_1$ has type $int$ and $e_2$ has type $int$, then $e_1 + e_2$ has type $int$
- $(e_1$ has type $int \land e_2$ has type $int) \Rightarrow e_1 + e_2$ has type $int$
- $(e_1:int \land e_2:int) \Rightarrow e_1 + e_2 : int$
The statement $(e_1:int \land e_2:int) \Rightarrow e_1 + e_2 : int$ is a special case of $H_1 \land \ldots \land H_n \Rightarrow C$; this is an inference rule

Notation for Inference Rules

By tradition, inference rules are written \[\frac{\vdash Hypothesis_1 \ldots \vdash Hypothesis_n}{\vdash Conclusion}\]
Type rules have hypotheses and conclusions of the form: \[\vdash e : T\]
$\vdash$ means “it is provable that …”

Example Rules

Example \[\frac{i \text{ is an integer}}{\vdash i : Int}\text{[Int]}\]

\[\frac{ \begin{array}{l} \vdash e_1 : Int\\ \vdash e_2 : Int \end{array}} {\vdash e_1 + e_2 : Int}\text{[Add]}\]
Thes rules give templates describing how to type integers and $+$ expressions
By filling in the templates, we can produce complete typings for expressions

Example: 1 + 2

\[\frac{ \begin{array}{l} \vdash 1 : Int\\ \vdash 2 : Int \end{array}} {\vdash 1 + 2 : int}\text{[Add]}\]

Summary

Scoping rules match identifier uses with identifier definitions
A type is a set of values coupled with a set of operations on those values
A type system specifies which operations are valid for which types
Type checking can be done statically (at compile time) or dynamically (at run time)