Compiler Principles Experiment 1 Lexical Analyzer Design

Source code repository: CompilePrincipleLearning/experiment_1 · yusixian/CompilePrincipleLearning (github.com)

1. Experiment Objectives#

Deeply understand finite automata and their applications
Master the method of constructing a finite automaton to recognize words based on the lexical rules of the language
Basic mastery of the development method of lexical analysis programs
Be able to design a lexical scanner program to perform lexical analysis on source programs and output a sequence of words

2. Experiment Content and Requirements#

Write a lexical analysis program to recognize words.

The DFA for various types of words in a certain language is shown in the figure below. Write a program to implement:

Input: txt file (stores the source program to be analyzed)
Output: Identify various independent words from the input source program, namely the five categories of basic reserved words, identifiers, constants, operators, and delimiters. Output the category code and the value of each word in sequence. (When encountering an error, display "Error", then skip the erroneous part and continue displaying).

Output format: Representation of each word: (category code, word symbol value)

Requirement: Each recognized word should be output on a single line.

Each type of word in the source program must be present.

3. Experiment Process#

1. Designed DFA Transition Diagram#

Letters and underscores: letter -> A|B|…|Z|a|b|c|d…|y|z|_

Digits: digit1 -> 1~9 digit-> 0~9

Identifier definition: id -> letter(letter|digit)*

Operator definition: op -> +-*/%=!&|<>

Keyword definition: keyword -> int float const bool void char double struct return if else while do static break for switch case default continue true false

Delimiter definition: delimiter -> ; , ' " * */ ? : ( ) [ ] } { .

Integer definition: int -> (+|-)(0 | digit1 digit*)

Character constant: char -> letter|digit|……

String constant: string -> char*

Floating-point definition: double-> (0|(+|-)digit1 digit*)(.digit*)

The DFA I drew is shown in the figure.

2. Data Structures Used#

The output Token stream consists of type name + category code + value (this keyword/variable name/number/operator/delimiter), overloaded output stream.

struct Token {
    int type;   // category code
    string value;       // value keyword/variable name/number/operator/delimiter
    string category;    // type name corresponding to the category code
    Token(int type, string value, string category) : type(type), value(value), category(category) {}
    friend ostream& operator<<(ostream& os, const Token& t) {
        os << t.category << ", type: " << t.type << ", value: " << t.value;
        return os;
    }
};

3. Header File Declarations and Global Variable Definitions#

As follows, it should be very clear.

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <map>
using namespace std;
const string CategoryFileName = "./categoryCode.txt";
const string CodeFileName = "./code.txt";
string keywords[22];  // keyword table category code 1-22
string operate[28];  // operator table category code 23-50
string delimiter[15];  // delimiter table category code 51-65
map<string, int> categoryCode;  // category code table
const string op = "+-*/%=!&|<>";
const int _EOF_ = -2;
const int _ERROR_ = -1;
enum { 
    _ID_, _INT_, _DOUBLE_, _OPERATOR_, _DELIMITER_, _KEYWORD_, _CHAR_, _STRING_, _COMMENT_, _SPACE_
};  // types
string cat[10] = { "id", "int", "double", "operator", "delimiter", "keyword", "char", "string", "comment", "space" };
struct Token {
    int type;   // category code
    string value;       // value keyword/variable name/number/operator/delimiter
    string category;    // type name corresponding to the category code
    Token(int type, string value, string category) : type(type), value(value), category(category) {}
    friend ostream& operator<<(ostream& os, const Token& t) {
        os << t.category << ", type: " << t.type << ", value: " << t.value;
        return os;
    }
};
int pos, len;  // current character position and length
string code, tempToken;  // currently recognized string
vector<Token> tokenList;  // stores recognized tokens

4. Function Summary#

(1) Function Summary Table#

Function Name	Brief Functionality
`readFile`	Reads the file function, returns a dynamic array of strings
`init`	Initialization function, reads category code files and keyword files, and performs corresponding assignments and initializations
`peek`	Peeks at the next character, returns that character if it exists, otherwise returns \0, which is the string end character
`isDigit`	Determines whether character ch is a digit 0-9
`isLetter`	Determines whether character ch is a letter or underscore (i.e., A-Z a-z _)
`isKeyword`	Determines whether string s is a keyword (in the keyword table)
`isOP`	Determines whether character ch is a single operator (in op)
`isOperator`	Determines whether string s is an operator (in the operator table)
`isDelimiter`	Determines whether character ch is a delimiter (in operate)
`judge`	Core function, determines and returns the enumeration type of the current character (`code[pos]`), and directly puts some special tokens into `tokenList` (such as comments, characters, and string constants)
`read_next`	Core function, reads the next character, and based on the returned enumeration type, puts the corresponding token into `tokenList`
`main`	Main program entry, from here, calls the `init` function to initialize

(2) Function Call Relationships#

5. Experiment Results#

Input#

code.txt

int main() {
    char ch = 'ss';
    string str = "Hello, World!"
    char ch2 = 's';
    init();
    double x = 10.31;/* some comment */
    int m = 0;
    int y = 310, m = 0.31;
    while(pos < len) {
        int flag = read_next();
        if(flag == _EOF_) break;
        if(flag != _ERROR_) {
            Token t(flag, tempToken);
            tokenList.push_back(t);
            cout << t << endl;
        } else cout << "Error!" << endl;
    }
    return 0;
}

Output#

$0I3RK{8 N%JNPNA</code></code></code></code></code></code></code></code></code></code></code></code></code></code></code></code>(WB_R.png$

4. Experiment Summary#

This experiment was quite interesting, and it was very rewarding when it finally ran smoothly. Personally, I feel that it is not necessary to be constrained by what algorithm to use; it is enough to clarify one's own thoughts on how to design the program to ensure correct recognition. There is mainly a priority approach, where spaces and newline characters are skipped, and then it is determined whether the character is a digit or a letter for corresponding processing, followed by some special delimiter checks, such as strings and comments. I believe the code is sufficient to clearly explain this process. This program currently only uses common symbols (.) to support decimals; if more are needed, modifications can be made after isdigit() in judge, which is not difficult. Obviously, the functions in the judge function can also be broken down into more detailed functions, but that can be completed later.

5. Answers to Questions#

Which aspects of program design affect the efficiency of lexical analysis? How to improve efficiency?#

Answer: There are still many parts that need optimization. For example, when determining whether it is a keyword, the current method is to read the string that may be an identifier or keyword into a character array and then match it one by one with the keyword table. This can be improved by checking while reading, which will increase efficiency. The same applies to delimiter matching.

Complete Code#

/*
 * @Author: cos
 * @Date: 2022-04-05 00:10:59
 * @LastEditTime: 2022-04-08 02:37:49
 * @LastEditors: cos
 * @Description: Lexical analyzer design implementation
 * @FilePath: \CS\experiment_1\demo\main.cpp
 */
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <map>
using namespace std;
const string CategoryFileName = "./categoryCode.txt";
const string CodeFileName = "./code.txt";
string keywords[22];  // keyword table category code 1-22
string operate[28];  // operator table category code 23-50
string delimiter[15];  // delimiter table category code 51-65
map<string, int> categoryCode;  // category code table
const string op = "+-*/%=!&|<>";
const int _EOF_ = -2;
const int _ERROR_ = -1;
enum { 
    _ID_, _INT_, _DOUBLE_, _OPERATOR_, _DELIMITER_, _KEYWORD_, _CHAR_, _STRING_, _COMMENT_, _SPACE_
};  // types
string cat[10] = { "id", "int", "double", "operator", "delimiter", "keyword", "char", "string", "comment", "space" };
struct Token {
    int type;   // category code
    string value;       // value keyword/variable name/number/operator/delimiter
    string category;    // type name corresponding to the category code
    Token(int type, string value, string category) : type(type), value(value), category(category) {}
    friend ostream& operator<<(ostream& os, const Token& t) {
        os << t.category << ", type: " << t.type << ", value: " << t.value;
        return os;
    }
};
int pos, len;  // current character position and length
string code, tempToken;  // currently recognized string
vector<Token> tokenList;  // stores recognized tokens
// Read file
vector<string> readFile(string fileName) {
    vector<string> res;
    try {
        ifstream fin;
        fin.open(fileName);
        string temp;
        while (getline(fin, temp))
            res.push_back(temp);
        return res;
    } catch(const exception& e) {
        cerr << e.what() << '\n';
        return res;
    }
}
void init() {
    vector<string> res = readFile(CategoryFileName); 
    // cout << "len:" << len << endl;
    for(int i = 0; i < 22; ++i) {
        keywords[i] = res[i];
        categoryCode[res[i]] = i+1;
        // cout << "keyword:" << res[i] << endl;
    }
    for(int i = 0; i < 28; ++i) {
        operate[i] = res[i + 22];
        categoryCode[res[i+22]] = i+23;
        // cout << "operate:" << res[i + 22] << endl;
    }
    for(int i = 0; i < 15; ++i) {
        delimiter[i] = res[i + 50];
        categoryCode[res[i+50]] = i+51;
        // cout << "delimiter:" << res[i + 50] << endl;
    }
    res = readFile(CodeFileName);
    for(int i = 0; i < res.size(); ++i)
        code += res[i]+'\n';
    len = code.size();
}
char peek() {
    if (pos+1 < len) return code[pos+1];
    else return '\0';
}
inline bool isDigit(char c) {
    return c >= '0' && c <= '9';
}
// Whether it is a letter or underscore
inline bool isLetter(char c) {
    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_';
}
bool isKeyword(string s) {
    for(int i = 0; i < 22; ++i)
        if (s == keywords[i]) 
            return true;
    return false;
}
bool isOP(char ch) {
    return op.find(ch) != string::npos;
}
bool isOperator(string s) {
    for(int i = 0; i < 28; ++i)
        if (s == operate[i]) return true;
    return false;
}
bool isDelimiter(char ch) {
    for(int i = 0; i < 15; ++i)
        if (ch == delimiter[i][0]) return true;
    return false;
}
int judge(char ch) {
    if(ch == '\n' || ch == ' ') return _SPACE_;
    if(isDigit(ch)) {
        char nextChar = peek();
        if(ch == '0' && nextChar == '.') { // 0. something
            ++pos;
            if(!isDigit(peek()))   // . is not followed by a digit
                return _ERROR_;
            tempToken = "0.";
            while(isDigit(peek())) {
                tempToken += peek();
                ++pos;
            }
            return _DOUBLE_;    // 8
        } else if(ch == '0' && !isDigit(nextChar)) { // not a digit and not ., indicating a pure 0
            tempToken = "0";
            return _INT_;   // 5
        } else if(ch != '0') {  // digit1
            tempToken = ch;
            while(isDigit(peek())) {
                tempToken += peek();
                ++pos;
            }
            char nextChar = peek();
            if(nextChar == '.') {
                tempToken += nextChar;
                ++pos;
                nextChar = peek();
                if(isDigit(nextChar)) {
                    tempToken += peek();
                    ++pos;
                    while(isDigit(peek())) {
                        tempToken += peek();
                        ++pos;
                    }
                    return _DOUBLE_;    // 8
                } else return _ERROR_;
            } else return _INT_;    // 6
        } else {    // 0 + digit
            ++pos;
            return _ERROR_;         // ERROR
        }
    }
    if(isLetter(ch)) {
        tempToken = ch;
        char nextChar = peek();
        while( isLetter(nextChar) || isDigit(nextChar) ) { // identifier~
            tempToken += nextChar;
            ++pos;
            nextChar = peek();
        }
        return isKeyword(tempToken) ? _KEYWORD_ : _ID_;
    } 
    if(ch == '\"') {
        tokenList.push_back(Token(54, "\"", cat[_DELIMITER_]));
        tempToken = "";
        char nextChar = peek();
        while(nextChar != '\"') {
            tempToken += nextChar;
            ++pos;
            nextChar = peek();
        }
        tokenList.push_back(Token(69, tempToken, cat[_STRING_]));
        tokenList.push_back(Token(54, "\"", cat[_DELIMITER_]));
        pos += 2;
        return _STRING_;
    }
    if(ch == '\'') {
        tempToken = "";
        ++pos;
        char nextChar = peek();
        if(nextChar == '\'') {
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            tempToken += code[pos];
            tokenList.push_back(Token(68, tempToken, cat[_CHAR_]));
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            ++pos;
            return _CHAR_;
        } else if(code[pos] == '\'') {
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            tokenList.push_back(Token(68, tempToken, cat[_CHAR_]));  // empty string
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            return _CHAR_;
        } else {
            while(pos < len && nextChar != '\'') {
                ++pos;
                nextChar = peek();
            }
            ++pos;
            return _ERROR_;
        }
    }
    if(ch == '/') {
        if(peek() == '*') {
            ++pos;
            char nextChar = peek();
            ++pos;
            tempToken = "";
            while(pos < len) {
                if(nextChar == '*' && peek() == '/') {
                    tokenList.push_back(Token(55, "/*", cat[_DELIMITER_]));
                    tokenList.push_back(Token(71, tempToken, cat[_COMMENT_]));
                    tokenList.push_back(Token(56, "*/", cat[_DELIMITER_]));
                    ++pos;
                    ++pos;
                    return _COMMENT_;
                } else {
                    tempToken += nextChar;
                    nextChar = peek();
                    ++pos;
                }
            }
            return _ERROR_;
        }
    }

    if(isOP(ch)) {   // op operator
        tempToken = "";
        tempToken += ch;
        char nextChar = peek();
        if(isOP(nextChar)) {
            if(isOperator(tempToken + nextChar)) {
                tempToken += nextChar;
                ++pos;
                return _OPERATOR_;      // 15
            } else return _OPERATOR_;   // 14
        } else return _OPERATOR_;       // 14
    }
    if(isDelimiter(ch)) {
        tempToken = "";
        tempToken += ch;
        return _DELIMITER_;
    }
    return _ERROR_;  
}
int read_next() {
    int type = judge(code[pos]);
    while(pos < len && type == _SPACE_) {
        ++pos;
        type = judge(code[pos]);
    }
    if(pos >= len) return _EOF_; 
    ++pos;
    if(type == _ERROR_) return _ERROR_;
    if(type == _DOUBLE_) {
        // cout << "double: " << tempToken << endl;
        tokenList.push_back(Token(67, tempToken, cat[_DOUBLE_]));
        return _DOUBLE_;
    }
    if(type == _INT_) {
        // cout << "int: " << tempToken << endl;
        tokenList.push_back(Token(66, tempToken, cat[_INT_]));
        return _INT_;
    }
    if(type == _ID_) {  // identifier
        // cout << "id: " << tempToken << endl;
        tokenList.push_back(Token(70, tempToken, cat[_ID_]));
        return _ID_;
    }
    if(type == _OPERATOR_ || type == _KEYWORD_ || type == _DELIMITER_) {
        tokenList.push_back(Token(categoryCode[tempToken], tempToken, cat[type]));
        return type;
    }
    return _ERROR_;
}
int main() {
    init();
    while(pos < len) {
        int flag = read_next();
        if(flag == _EOF_) break;
        else if(flag == _ERROR_) tokenList.push_back(Token(_ERROR_, "ERROR!", "ERROR"));
    }
    for(auto t : tokenList)
        cout << t << endl;
    return 0;
}