The Lexer
- Due:
- 11:00 pm, Monday February 22, 2021
Max grace days: 2
Overview
Programming assignments 2 through 5 will direct you to design and build an interpreter for Cool. Each assignment will cover one component of the interpreter: lexical analysis, semantic analysis, and operational semantics. Each assignment will ultimately result in a working interpreter phase which can interface with the other phases.
You must complete this assignment using OCaml.
For this assignment you will write a lexical analyzer, also called a scanner, using a lexical analyzer generator. You will describe the set of tokens for Cool in an appropriate input format and the analyzer generator will generate actual code. You will then write additional code to serialize the tokens for use by later interpreter stages.
Specification
You must create three artifacts:
- A program that rakes a single command line argument (for example
file.cl
). That argument will be an ASCII text Cool source file. Your program must either indicate that there is an error in the input (for example, a malformed string) or emitfile.cl-lex
, a serialized list of Cool tokens. Your program’s main lexer component must be constructed by a lexical analyzer generator. The “glue code” for processing command line arguments and serializing tokens should be written by hand. If your program is calledlexer
, then invokinglexer file.cl
should yield the same output ascool --lex file.cl
. Your program will consist of a number of OCaml files or a number of Python files. - A plain ASCII text file called
README
describing your design decisions and choice of test cases. See the grading criteria. A few paragraphs should suffice. - Testcases
good.cl
andbad.cl
. The first should lex correctly and yield a sequence of tokens. The second should contain an error.
Note: you must use ocamllex. Do not write your entire lexer by hand. Parts of it must be tool-generated from regular expressions you provide.
Considerations:
- Line numbers: The first line in a file is 1. Each successive
\n
newline character increments the line count. Your lexer is responsible for keeping track of the current line number. Error reporting: To report an error, write the string
ERROR: line_number: Lexer: message
to standard output and terminate the program. You may write whatever you want in the message, but it should be fairly indicative. Example erroneous input:
Backslash not allowed\
Example error report output:
ERROR: 1: Lexer: invalid character: \
The .cl-lex
File Format
If there are not errors in file.cl
, then your program should create file.cl-lex
and serialize the tokens to it. Each token is represented by a pair (or triplet) of lines. The first line holds the line number. The second line gives the name of the token. The optional third line holds additional information (that is, the lexeme) for identifiers, integers, strings, and types. For example, for an integer token, the third line should contain the decimal integer value.
Example input:
Backslash not
allowed
Corresponding .cl-lex
output:
1
type
Backslash
1
not
2
identifier
allowed
The official list of token names is:
at case class colon comma divide dot else equals esac false fi identifier if in inherits integer isvoid larrow lbrace le let loop lparen lt minus new not of plus pool rarrow rbrace rparen semi string then tilde times true type while
In general the intended token is evident. For the more exotic names:
at = @, larrow = <-, lbrace = {, le = <=, lparen = (, lt = <, rarrow = =>, rbrace = }, semi = ;, tilde = ~
The .cl-lex
file format is exactly the same as the one generated by the reference compiler when you specify --lex
. In addition, the reference compiler (and your upcoming assignment 3 parser) will read .cl-lex
files instead of .cl
files.
Lexical Analyzer Generators
You must use a lexical analyzer generator or similar library for this assignment.
- The OCaml lexical analyzer generator is called
ocamllex
and it comes with any OCaml distribution.
This lexical analyzer generator is derived from lex
(or flex
), the original lexical analyzer generator for C. Thus you may find it useful to refer to the Lex paper or the Flex manual.
Commentary
You can do basic testing with something like the following:
linux> ./cool.exe --out reference --lex file.cl
linux> main file.cl
linux> diff -b -B -E -w file.cl-lex reference.cl-lex
For example,
linux> ./cool.exe --out reference --lex file.cl
linux> ocamllex my-lexer.mll
linux> ocaml my-lexer file.cl
linux> diff file.cl-lex reference.cl-lex
You may find the reference compiler’s --unlex
option useful for debugging your .cl-lex
files.
To run the test cases, execute the command:
~schwesin/csc310/bin/test_lexer <zip file name>
on the Linux server where <zip file name>
is the name of your zip file containing a file named main.ml
.
Turning in the Assignment
You must turn in a zip file containing these files:
README
– your README filegood.cl
– a novel positive test casebad.cl
– a novel negative test case- source files – including -
main.ml
andlexer.mll
To submit the assignment, execute the command:
~schwesin/bin/submit-lexer.bash <zip file name>
on the Linux server where <zip file name>
is the name of your zip file.
Grading Criteria
Grading (out of 50 points):
- 38 points – autograder tests
- Each failed test removes one point, to a minimum of 0/38, even if there are more tests than total points.
- 4 points – for a clear description in your README
- 4 – thorough discussion of design decisions (including handling of strings and comments) and choice of test cases; a few paragraphs of coherent English sentences should be fine
- 2 – vague or hard to understand; omits important details
- 0 – little or no effort, or submitted an RTF/DOC/PDF file instead of plain TXT
- 4 points – for valid and novel
good.cl
andbad.cl
files- 4 – wide range of test cases added, stressing most Cool features and an error condition, novel files
- 2 – added some tests, but the scope not sufficiently broad
- 0 – little or no effort, or submitted an RTF/DOC/PDF file instead of plain TXT, or submitted part of course files as test cases
- 4 points - for code cleanliness
- 4 – code is mostly clean and well-commented
- 2 – code sloppy and/or poorly commented in places
- 0 – little to no effort to organize and document code