ARITHMETIC INTERPRETER/COMPILER PRELIMINARY DESIGN

File:		interpreter.pd
Author:		course (walton@das.harvard.edu)
Version:	1


1. Goal

The arithmetic interpreter implements an interpreter
and a compiler for a simple arithmetic language.

2. Components

   2.1 Basic Support

       Basic support provides basic I/O facilities and
       interfaces to error routines that are not
       specific to the interpreter/compiler.  The ein
       input stream is defined that echos its input
       when reading from a file.

       Files: basic.h estream.h basic.cc estream.cc

   2.2 Error Handling

       Error routines are provided that print an error
       message and then do a longjmp to recover from
       the error by going back to the interpreter top
       level so the user can continue.

       Files: error.cc

   2.3 Lexical Analyzer

       The lexical analyzer reads the character stream
       input (ein) and builds lexemes in memory.

       Files: lexeme.h lexeme.cc

   2.4 Parser

       The parser reads a lexeme stream and parses it
       using the recursive descent method.  Rather than
       build a parse tree in memory, the parser uses its
       own call tree to represent the parse tree, and
       calls semantic algorithm functions to compute
       a value for appropriate nodes of the parse tree. 

       Files: parser.h parser.cc

   2.5 Semantic Algorithms

       The semantic algorithms provide functions for
       computing values.  An abstract value type,
       avalue, is defined for values of parse nodes.
       Operators for adding and multiplying avalues are
       provided, as well as functions for getting an
       avalue from a number or symbol lexeme and storing
       an avalue in a symbol lexeme.

       Two different implementations of the avalue
       abstract type are provided.

       The interpreter implementation defines an avalue
       to be a `long int'.

       Files: iavalue.h iavalue.cc

       The compiler implementation defines an avalue to
       be a numeric code that names a register where the
       value will be when the compiled code is run.  The
       compiler implementation also emits the assembly
       language code necessary to compute the value in
       the required place given the registers where
       input values are.

       Files: cavalue.h cavalue.cc

   2.6 Interpreter Top Level

       The interpreter top level prompts the user for
       input, and calls the parser to read and parse
       the input and compute a value.  The parser calls
       the lexical analyzer to read lexemes from the
       input.  The parser calls the semantic algorithm
       routines to assign values to the parse tree
       nodes.  The top level prints the value computed
       for the entire program.  The top level also
       catches any error longjmps and reprompts the
       user for more input.

       Files: interpreter.cc

   2.7 Note on Makefile:  Several .cc files generate
       different .o files for the interpreter and
       compiler as follows:

       Source File	    Interpreter	    Compiler

       lexeme.cc	    ilexeme.o	    clexeme.o
       parser.cc	    iparser.o	    cparser.o
       interpreter.cc	    interpreter.o   compiler.o

       The .o files are generated by copying either
       iavalue.h for the interpreter or cavalue.h for
       the compiler into avalue.h, and compiling the
       .cc files.


3. Lexical Analyzer

    The lexical analyzer reads input from ein (like cin
    but echos: see estream.h), and outputs lexemes.

   3.1 Lexeme Specification
   
       The following is a specification for the lexemes
       of the arithmetic language.

       (1) Whitespace characters are not part of any
	   lexeme.  They may be used to separate
	   lexemes.

	   There a no comments in this language.

       (2) The characters +, -, *, / are each single
	   character operator lexemes.

       (3) The characters =, ., (, ), and ; are each
	   single character lexemes.

       (4) A number lexeme is any character string
	   having the syntax of a C++ integer that
	   begins with a digit (on encountering a digit,
	   use `ein >> x' to read an integer into an
	   integer variable x).

       (5) A symbol lexeme consists of any sequence of
	   letters and digits beginning with a letter.

       (6) The process of converting a character stream
	   (ein) to a lexeme stream consists of

	   (a) Remove whitespace from the beginning of
	       the character string (use ein >> ws)

	   (b) Remove the longest possible lexeme from
	       the beginning of the character string.

	   Repeat (a) and (b) until the end of program
	   lexeme (a period, `.') has been read.

   3.2  Lexeme Grammar
   
	The lexemes of our language can also be
	specified by the grammar:

	lexeme ::= symbol | number | operator
		 | assignment | end-of-file
		 | left-parenthesis | right-parenthesis
		 | semi-colon

	symbol ::= letter letter-or-digit*

	letter-or-digit ::= letter | digit

	letter ::= a | b | ... | z | A | B | ... | Z

	digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

	number ::= <C++ integer beginning with a digit>

	operator ::= + | - | * | /

	assignment ::= =

	end-of-file ::= .

	left-parenthesis ::= (

	right-parenthesis ::= )

	semi-colon ::= ;

	file ::= { whitespace lexeme }* whitespace

	whitespace ::= whitespace-character*

	whitespace-character ::=
		<character skipped by cin >> ws;>

   3.3 Lexeme DFA
   
       The lexemes of this language can also be
       specified by a deterministic finite automaton:

       c is current character

       BEGIN_LEXEME:   isspace(c) ---> BEGIN_LEXEME
			    {discard}
		       isalpha(c) ---> SYMBOL
			    {accept}
		       isdigit(c) ---> NUMBER
			    {putback}
		       + ---> BEGIN_LEXEME
			    {accept; output operator}
		       - ---> BEGIN_LEXEME
			    {accept; output operator}
		       * ---> BEGIN_LEXEME
			    {accept; output operator}
		       / ---> BEGIN_LEXEME
			    {accept; output operator}
		       = ---> BEGIN_LEXEME
			    {discard}
			    {output assignment}
		       . ---> BEGIN_LEXEME
			    {discard}
			    {output end-of-file}
		       ( ---> BEGIN_LEXEME
			    {discard}
			    {output left-parenthesis}
		       ) ---> BEGIN_LEXEME
			    {discard}
			    {output right-parenthesis}
		       ; ---> BEGIN_LEXEME
			    {discard}
			    {output semi-colon}
		       other ---> BEGIN_LEXEME
			    {announce c is erroneous}
			    {discard}

       SYMBOL:	       isalpha(c) ---> SYMBOL
			    {accept}
		       isdigit(c) ---> SYMBOL
			    {accept}
		       other ---> BEGIN_LEXEME
			    {putback}
			    {output symbol}

       NUMBER:	       {call cin >> i for int i}
			   {output i as number}
			   ---> BEGIN_LEXEME

       Here each transition is associated with actions.
       There is a lexeme string buffer that holds the
       lexeme being accumulated.  The possible actions
       regarding the character c being input and the
       string buffer are:

	accept		Move c to the end of the
			lexeme string buffer.

	discard		Remove c from the input, but
			leave the lexeme string buffer
			untouched.

	putback		Leave both the input and the
			lexeme string buffer untouched.
			This is like c = get() followed
			by putback(c).

       The following additional actions are

	output ...	Output the contents of the
			lexeme string buffer as a ...
			and empty the lexeme string
			buffer.

	announce ...	Print an error message.

       In addition we have been lazy in writing the part
       of the DFA for numbers, leaving it to the C++ >>
       operator for integers to define that part of the
       DFA.

       The DFA form of the lexical analyzer represents
       the actions, but not the form, of the actual
       lexical analyzer code.

   3.4 Lexeme Data Grammar

       Lexemes are represented in memory by structures
       that have members giving the type of the lexeme,
       the name of a symbol, the value of a number,
       the op code of an operation, etc.

       A data grammar for these structures is:

	   lexeme ::= ( type, pname, op, value, inited,
			next )

	   type ::= LEX_SYMBOL | LEX_NUMBER | LEX_OP
		  | LEX_ASSIGN | LEX_EOF
		  | LEX_LEFT_PAREN
		  | LEX_RIGHT_PAREN | LEX_SEMICOLON

	    pname ::= <C character string>

	    op ::= OP_PLUS | OP_MINUS | OP_TIMES
		 | OP_DIVIDE

	    value ::= <an avalue>

	    inited ::= true | false

	    next ::= pointer-to-lexeme | NULL

	    symtab ::= pointer-to-first-lexeme

       Symbols have their name stored in pname.

       Operations have their op code stored in op.

       Numbers have their value stored in `value'.
       
       The user defined `avalue' type, denoting an
       `arithmetic value', has operators +, -, *, /
       like a number type.  It can be read by ein >> x
       for an avalue x.

       Symbols may also have a `value', and have the
       inited member set to true if the symbol has been
       given a value (initialized).
       
       Symbols are linked together into a symbol table
       using the `next' member and the `symtab' static
       variable.  The symbol table is used to avoid
       creating two symbols with the same name.

   3.5 Lexeme Data Code Synopsis

       const int MAX_SYMBOL_LENGTH = 100;

       enum optype { OP_PLUS, OP_MINUS,
		     OP_TIMES, OP_DIVIDE };

       enum lextype { LEX_SYMBOL, LEX_NUMBER, LEX_OP,
      		      LEX_ASSIGN, LEX_EOF,
		      LEX_LEFT_PAREN, LEX_RIGHT_PAREN,
		      LEX_SEMICOLON };

       typedef lexeme_class * lexeme;

       class lexeme_class {
	   lextype	type;
	   char *	pname;
	   optype	op;
	   avalue	value;
	   bool		inited;
	   lexeme	next; }

       static lexeme symtab;	[In lexeme.cc]

    3.6 Lexical Analyzer Function Synopsis

	Let:	lexeme l;

	Input/Output of Lexemes:

	    cout << l

	    ein >> l

	Initialization of Lexeme Memory:

	    void init_lexemes ( void );

	Test for end of file in ein (no next lexeme):

	    bool end_lexemes ( void );

	Get the next lexeme (and permit lexeme backup):

	    lexeme get_lexeme ( void );

	Backup over the last lexeme gotten:

	    void unget_lexeme ( void );

	Flush one line of lexemes (use after error):

	    void flush_lexemes ( void );

4. Parser

   4.1 Grammar

       The grammar for our arithmetic language is:	

	   program ::= asst* .

	   asst    ::= symbol = expr ;

	   expr    ::= term { { + | - } term }*

	   term    ::= factor { { * | / } factor }*

	   factor  ::= symbol | number | ( expr )
		     | - factor

       In the subroutines of the parser, the `get_' is
       dropped from the names of syntactic units, so
       the names are

	   program
	   asst		(assignment statement)
	   expr		(expression)
	   term
	   factor

   4.2 Parser Construction

       The parser is a standard one-token-lookahead
       recursive descent parser, which computes a value
       for the parse tree nodes as it parses, and
       returns the value from each parse function.

5. Semantic Algorithm

   The semantics computes a value.  The values are of
   type `avalue' which has operators +, -, *, / defined
   on it.  The interpreter uses a definition in which
   avalue is typedef'ed to be the same as `long int'.

   The avalue can be input/output with >> and <<.

   5.1 Other Functions

       Initialize the value computing system:

	   void init_avalues ( void );

       Return the value of a number or symbol lexeme:

	   avalue evaluate ( lexeme l );

       Assign a value to a symbol lexeme:

	   void assign ( lexeme l, avalue v );

       The zero avalue (value of an empty program):

	   extern avalue zero_avalue;

   5.2 Compiler Semantics

       The compiler semantics algorithm places values
       in registers and emits code that computes
       with the registers.

       Values of avalue type store register names.

       Temporary registers $t0, $t1, ... $t7 are
       allocated to hold expression values within
       an assignment statement.  Each such register
       is freed when it is used.

       The evaluate function simply returns the value
       component of a symbol or number.

       When a number lexeme is read, ein >> is used
       to read the value component of the number lexeme,
       which is an avalue.  Reading an avalue allocates
       a temporary register to hold the value, emits an
       instruction to load the number into the register,
       and stores the register name as the number lexeme
       value.

       When the assign function is first applied to a
       symbol, it allocates a save register, one of
       $s0, $s1, ..., $s7, to hold the value of the
       variable named by the symbol, and stores the
       name of this register as the symbol lexeme value.

       The assign function always emits code to move its
       value register to the symbol register.  The value
       register is freed if it is a temporary register.

       Operators such as + allocate a temporary
       register, emit code to compute their result in
       that register given inputs in the input
       registers, and free the input registers if these
       are temporary registers.

6. Top Level Interpreter

   The top level interpreter, or main program,
   repeatedly initializes lexemes and avalues, prompts
   the user, calls the `program' parser routine to parse
   the next program in the input, and prints the value.

   If there is no next lexeme when `program' is about
   to be called, the program terminates.

   If there is an error during parsing (which includes
   evaluation or compilation), the appropriate error
   function prints an error message and does a longjmp
   to the top level, which flushes the input, prints
   a line feed, and prompts the user for another
   program.