//// Specifications for get_token and backup_token //// in scanner.cc //// //// File: scanner.sp //// Author: course //// Version: 3 //// /////////////////////////////////////////////////////// // // Function: get_token // Argument type: istream & // Return value type: token // Assumes: information in scanner.h // Get_token() will read characters from its argument input stream, and output an instance of the token class. By "the token string" we mean the string of characters that represents the token in the input stream, excluding any preceding white space or comments, and by "the token instance" we mean the instance of the token class returned by get_token. A space character is defined as a character for which the C function isspace() returns true (see ). A symbolic character is defined as any character that can occur in a Common LISP symbol name without being escaped (see below). For convenience, this is the same as any character for which the C++ function symbolic() returns true (see scanner.cc), except that symbolic() also returns true for `|'. Characters are read either in normal mode or in escaped mode. The vertical bar character `|' toggles the mode; e. g. in the symbol name x|;|yabc|d|ef the characters `;' and `d' are read in escaped mode. A character read in escaped mode is said to be an escaped character: e. g. an escaped `;' or escaped `d'. Note that all lowercase letters are converted to uppercase by the scanner unless they are escaped. An atom representative is defined as a sequence of consecutive atom characters, where an atom character is a symbolic character or an escaped character or the character `|'. Thus the atom representative x|;|d has five atom characters, the middle one being the escaped character `;'. When such an atom representative is converted to a symbol, the `|' characters are deleted and unescaped lower case letters are converted to upper case. (1) Characters will be read from the input stream s by c = s.get() or c = s.peek() and at most one gotten character may be put back by s.putback(c). End of file will be detected by testing c == EOF (where EOF is a standard defined constant equal to -1). Note that EOF is not an ASCII character and cannot be stored in a character string. It also cannot be putback using s.putback(c), but it is desireable to simulate putting back an EOF by other means. Thus an input stream is a string of ASCII characters followed by an infinite sequence of EOF's. (2) A token string will not contain unescaped spaces. Any unescaped spaces encountered in the input stream will be ignored. Note, however, that unescaped spaces delimit a token string. (3) An unescaped semi-colon `;' and all characters following it through the next line feed `\n' or end of file EOF, will be treated as a single unescaped space character, i. e. a comment. Note that since comments are treated as a single unescaped space, they may also delimit token strings. (4) The following single unescaped characters will be treated as complete one character token strings and the token instance will have the indicated token_type: ( LPAREN_TOKEN ) RPAREN_TOKEN ] RBRACKET_TOKEN ' QUOTE_TOKEN For example, upon reading the (unescaped) character `(', get_token should return an instance of the token LPAREN_TOKEN. (5) An EOF will be treated like a complete one character token string and the token instance will have the EOF_TOKEN token_type. (6) The following pair of consecutive unescaped characters will be treated as a complete two character token string and the token instance will have the indicated token_type: #' FNQUOTE_TOKEN (7) A token string representing an atom begins with a symbolic character other than `#' or with the vertical bar character `|'. The next non-escaped, non-symbolic, non-`|' character will end the token string and not be part of the token string. The token instance may be empty, i.e. have no characters; e.g. on the input `||' or `||||'. (8) If the token string is the atom `.', the token instance will have the token_type DOT_TOKEN. Note `.' is a dot token, but `||.' and `.||' are symbols. (9) If the token string is an atom consisting of an optional `+' or `-' sign followed by digits, with nothing else, and with at least one digit, then the token instance will have the token_type NUMBER_TOKEN. Note that `||9' and `+998||' are symbols and not numbers. (10) If the token string is an atom and the token instance does not have a token_type defined by rule (8) or (9), then the token instance will have the token_type SYMBOL_TOKEN. (11) The # character cannot begin any token except #'. Thus #|| is NOT a legal symbol token. However, ||# is a legal symbol token, and # can appear in an atom token string anywhere EXCEPT at the beginning. (12) If any character is encountered in the input stream that cannot be processed according to rules (1)-(11), the character will be treated as a one character token string and the token instance will have the type ERROR_TOKEN. Thus in #x the # will be a one character token of type ERROR_TOKEN. The x will be part of the next token. Similarly any unescaped " or ` will be a one character token of type ERROR_TOKEN. (Our system does not handle these.) (13) If an ESCAPED end of file is encountered while scanning an atom, the entire atom will be treated as a token instance of type ERROR_TOKEN. (14) If the token instance has the token_type NUMBER_TOKEN, then the token instance will have a value component computed by first applying the C function atol() to the token string (see ), then calling make_fixnum() on the result. (15) If the token instance has the token_type SYMBOL_TOKEN, then the token instance will have a value component computed by applying make_symbol() to the token string after `|''s have been removed and after unescaped lower case letters have been converted to uppercase (use the C function toupper() in ). (16) If the value component of the token instance cannot be determined according to rule (14) or (15), it will be undefined. (17) If get_token() finds a token string (which it will always do, possibly finding an EOF), then it may read past the end of the token string by at most one character, and put back that one character into the input stream. Putting back EOF's needs to be simulated somehow: since every EOF is followed by an EOF, all that is necessary is to avoid calling s.putback(c) if c == EOF. (18) Get_token() will always save the token instance it is about to return in a static storage location for use by backup_token below. /////////////////////////////////////////////////////// // // Function: backup_token // Argument type: none // Return value type: none // Assumes: information in scanner.h // When backup_token() is called, the next call to get_token() will not read any characters from the input stream. Instead, get_token() will return the token instance it returned the last time it was called (see rule (18) above). Only the immediately next call to get_token() will be affected by a call to backup_token(). The effect of calling backup_token() more than once between calls to get_token() is undefined. Neither backup_token() nor the subsequent call to get_token() will operate on the input stream.