ARITHMETIC INTERPRETER/COMPILER PRELIMINARY DESIGN File: interpreter.pd Author: course (walton@das.harvard.edu) Version: 1 1. Goal The arithmetic interpreter implements an interpreter and a compiler for a simple arithmetic language. 2. Components 2.1 Basic Support Basic support provides basic I/O facilities and interfaces to error routines that are not specific to the interpreter/compiler. The ein input stream is defined that echos its input when reading from a file. Files: basic.h estream.h basic.cc estream.cc 2.2 Error Handling Error routines are provided that print an error message and then do a longjmp to recover from the error by going back to the interpreter top level so the user can continue. Files: error.cc 2.3 Lexical Analyzer The lexical analyzer reads the character stream input (ein) and builds lexemes in memory. Files: lexeme.h lexeme.cc 2.4 Parser The parser reads a lexeme stream and parses it using the recursive descent method. Rather than build a parse tree in memory, the parser uses its own call tree to represent the parse tree, and calls semantic algorithm functions to compute a value for appropriate nodes of the parse tree. Files: parser.h parser.cc 2.5 Semantic Algorithms The semantic algorithms provide functions for computing values. An abstract value type, avalue, is defined for values of parse nodes. Operators for adding and multiplying avalues are provided, as well as functions for getting an avalue from a number or symbol lexeme and storing an avalue in a symbol lexeme. Two different implementations of the avalue abstract type are provided. The interpreter implementation defines an avalue to be a `long int'. Files: iavalue.h iavalue.cc The compiler implementation defines an avalue to be a numeric code that names a register where the value will be when the compiled code is run. The compiler implementation also emits the assembly language code necessary to compute the value in the required place given the registers where input values are. Files: cavalue.h cavalue.cc 2.6 Interpreter Top Level The interpreter top level prompts the user for input, and calls the parser to read and parse the input and compute a value. The parser calls the lexical analyzer to read lexemes from the input. The parser calls the semantic algorithm routines to assign values to the parse tree nodes. The top level prints the value computed for the entire program. The top level also catches any error longjmps and reprompts the user for more input. Files: interpreter.cc 2.7 Note on Makefile: Several .cc files generate different .o files for the interpreter and compiler as follows: Source File Interpreter Compiler lexeme.cc ilexeme.o clexeme.o parser.cc iparser.o cparser.o interpreter.cc interpreter.o compiler.o The .o files are generated by copying either iavalue.h for the interpreter or cavalue.h for the compiler into avalue.h, and compiling the .cc files. 3. Lexical Analyzer The lexical analyzer reads input from ein (like cin but echos: see estream.h), and outputs lexemes. 3.1 Lexeme Specification The following is a specification for the lexemes of the arithmetic language. (1) Whitespace characters are not part of any lexeme. They may be used to separate lexemes. There a no comments in this language. (2) The characters +, -, *, / are each single character operator lexemes. (3) The characters =, ., (, ), and ; are each single character lexemes. (4) A number lexeme is any character string having the syntax of a C++ integer that begins with a digit (on encountering a digit, use `ein >> x' to read an integer into an integer variable x). (5) A symbol lexeme consists of any sequence of letters and digits beginning with a letter. (6) The process of converting a character stream (ein) to a lexeme stream consists of (a) Remove whitespace from the beginning of the character string (use ein >> ws) (b) Remove the longest possible lexeme from the beginning of the character string. Repeat (a) and (b) until the end of program lexeme (a period, `.') has been read. 3.2 Lexeme Grammar The lexemes of our language can also be specified by the grammar: lexeme ::= symbol | number | operator | assignment | end-of-file | left-parenthesis | right-parenthesis | semi-colon symbol ::= letter letter-or-digit* letter-or-digit ::= letter | digit letter ::= a | b | ... | z | A | B | ... | Z digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number ::= operator ::= + | - | * | / assignment ::= = end-of-file ::= . left-parenthesis ::= ( right-parenthesis ::= ) semi-colon ::= ; file ::= { whitespace lexeme }* whitespace whitespace ::= whitespace-character* whitespace-character ::= > ws;> 3.3 Lexeme DFA The lexemes of this language can also be specified by a deterministic finite automaton: c is current character BEGIN_LEXEME: isspace(c) ---> BEGIN_LEXEME {discard} isalpha(c) ---> SYMBOL {accept} isdigit(c) ---> NUMBER {putback} + ---> BEGIN_LEXEME {accept; output operator} - ---> BEGIN_LEXEME {accept; output operator} * ---> BEGIN_LEXEME {accept; output operator} / ---> BEGIN_LEXEME {accept; output operator} = ---> BEGIN_LEXEME {discard} {output assignment} . ---> BEGIN_LEXEME {discard} {output end-of-file} ( ---> BEGIN_LEXEME {discard} {output left-parenthesis} ) ---> BEGIN_LEXEME {discard} {output right-parenthesis} ; ---> BEGIN_LEXEME {discard} {output semi-colon} other ---> BEGIN_LEXEME {announce c is erroneous} {discard} SYMBOL: isalpha(c) ---> SYMBOL {accept} isdigit(c) ---> SYMBOL {accept} other ---> BEGIN_LEXEME {putback} {output symbol} NUMBER: {call cin >> i for int i} {output i as number} ---> BEGIN_LEXEME Here each transition is associated with actions. There is a lexeme string buffer that holds the lexeme being accumulated. The possible actions regarding the character c being input and the string buffer are: accept Move c to the end of the lexeme string buffer. discard Remove c from the input, but leave the lexeme string buffer untouched. putback Leave both the input and the lexeme string buffer untouched. This is like c = get() followed by putback(c). The following additional actions are output ... Output the contents of the lexeme string buffer as a ... and empty the lexeme string buffer. announce ... Print an error message. In addition we have been lazy in writing the part of the DFA for numbers, leaving it to the C++ >> operator for integers to define that part of the DFA. The DFA form of the lexical analyzer represents the actions, but not the form, of the actual lexical analyzer code. 3.4 Lexeme Data Grammar Lexemes are represented in memory by structures that have members giving the type of the lexeme, the name of a symbol, the value of a number, the op code of an operation, etc. A data grammar for these structures is: lexeme ::= ( type, pname, op, value, inited, next ) type ::= LEX_SYMBOL | LEX_NUMBER | LEX_OP | LEX_ASSIGN | LEX_EOF | LEX_LEFT_PAREN | LEX_RIGHT_PAREN | LEX_SEMICOLON pname ::= op ::= OP_PLUS | OP_MINUS | OP_TIMES | OP_DIVIDE value ::= inited ::= true | false next ::= pointer-to-lexeme | NULL symtab ::= pointer-to-first-lexeme Symbols have their name stored in pname. Operations have their op code stored in op. Numbers have their value stored in `value'. The user defined `avalue' type, denoting an `arithmetic value', has operators +, -, *, / like a number type. It can be read by ein >> x for an avalue x. Symbols may also have a `value', and have the inited member set to true if the symbol has been given a value (initialized). Symbols are linked together into a symbol table using the `next' member and the `symtab' static variable. The symbol table is used to avoid creating two symbols with the same name. 3.5 Lexeme Data Code Synopsis const int MAX_SYMBOL_LENGTH = 100; enum optype { OP_PLUS, OP_MINUS, OP_TIMES, OP_DIVIDE }; enum lextype { LEX_SYMBOL, LEX_NUMBER, LEX_OP, LEX_ASSIGN, LEX_EOF, LEX_LEFT_PAREN, LEX_RIGHT_PAREN, LEX_SEMICOLON }; typedef lexeme_class * lexeme; class lexeme_class { lextype type; char * pname; optype op; avalue value; bool inited; lexeme next; } static lexeme symtab; [In lexeme.cc] 3.6 Lexical Analyzer Function Synopsis Let: lexeme l; Input/Output of Lexemes: cout << l ein >> l Initialization of Lexeme Memory: void init_lexemes ( void ); Test for end of file in ein (no next lexeme): bool end_lexemes ( void ); Get the next lexeme (and permit lexeme backup): lexeme get_lexeme ( void ); Backup over the last lexeme gotten: void unget_lexeme ( void ); Flush one line of lexemes (use after error): void flush_lexemes ( void ); 4. Parser 4.1 Grammar The grammar for our arithmetic language is: program ::= asst* . asst ::= symbol = expr ; expr ::= term { { + | - } term }* term ::= factor { { * | / } factor }* factor ::= symbol | number | ( expr ) | - factor In the subroutines of the parser, the `get_' is dropped from the names of syntactic units, so the names are program asst (assignment statement) expr (expression) term factor 4.2 Parser Construction The parser is a standard one-token-lookahead recursive descent parser, which computes a value for the parse tree nodes as it parses, and returns the value from each parse function. 5. Semantic Algorithm The semantics computes a value. The values are of type `avalue' which has operators +, -, *, / defined on it. The interpreter uses a definition in which avalue is typedef'ed to be the same as `long int'. The avalue can be input/output with >> and <<. 5.1 Other Functions Initialize the value computing system: void init_avalues ( void ); Return the value of a number or symbol lexeme: avalue evaluate ( lexeme l ); Assign a value to a symbol lexeme: void assign ( lexeme l, avalue v ); The zero avalue (value of an empty program): extern avalue zero_avalue; 5.2 Compiler Semantics The compiler semantics algorithm places values in registers and emits code that computes with the registers. Values of avalue type store register names. Temporary registers $t0, $t1, ... $t7 are allocated to hold expression values within an assignment statement. Each such register is freed when it is used. The evaluate function simply returns the value component of a symbol or number. When a number lexeme is read, ein >> is used to read the value component of the number lexeme, which is an avalue. Reading an avalue allocates a temporary register to hold the value, emits an instruction to load the number into the register, and stores the register name as the number lexeme value. When the assign function is first applied to a symbol, it allocates a save register, one of $s0, $s1, ..., $s7, to hold the value of the variable named by the symbol, and stores the name of this register as the symbol lexeme value. The assign function always emits code to move its value register to the symbol register. The value register is freed if it is a temporary register. Operators such as + allocate a temporary register, emit code to compute their result in that register given inputs in the input registers, and free the input registers if these are temporary registers. 6. Top Level Interpreter The top level interpreter, or main program, repeatedly initializes lexemes and avalues, prompts the user, calls the `program' parser routine to parse the next program in the input, and prints the value. If there is no next lexeme when `program' is about to be called, the program terminates. If there is an error during parsing (which includes evaluation or compilation), the appropriate error function prints an error message and does a longjmp to the top level, which flushes the input, prints a line feed, and prompts the user for another program.