コンパイラ原理実験1 字句解析器設計

ソースコードリポジトリ：CompilePrincipleLearning/experiment_1 · yusixian/CompilePrincipleLearning (github.com)

一。実験目的#

1．有限オートマトンとその応用を深く理解する
2．言語の字句規則に基づいて、その単語を認識する有限オートマトンを構築する方法を習得する
3．字句解析プログラムの開発方法を基本的に習得する
4．字句スキャナプログラムを設計し、ソースプログラムを字句解析し、単語の列を出力できるようにする

二。実験内容及び要求#

単語を認識する字句解析プログラムを作成する。

ある言語における各種単語の DFA は以下の通りであり、プログラムを作成して実現する：

1、入力：txt ファイル（解析するソースプログラムを格納）

2、出力：入力されたソースプログラムから、独立した意味を持つ各単語、すなわち基本的な予約語、識別子、定数、演算子、区切り記号の 5 つのカテゴリを認識し、それぞれの単語の種別コードと単語シンボル自身の値を順次出力する。（エラーが発生した場合は「Error」と表示し、エラー部分をスキップして続行する）。

出力形式：各単語の表現：（種別コード、単語シンボル自身の値）

要求：認識された各単語を 1 行ずつ出力する。

ソースプログラム中の各種単語は必ず存在すること。

三。実験過程#

1、設計した DFA 変換図#

アルファベットとアンダースコア：letter -> A|B|…|Z|a|b|c|d…|y|z|_

数字：digit1 -> 1~9 digit-> 0~9

識別子定義：id -> letter(letter|digit)*

演算子定義：op -> +-*/%=!&|<>

キーワード定義：keyword -> int float const bool void char double struct return if else while do static break for switch case default continue true false

区切り記号定義：delimiter -> ; , ' " * */ ? : ( ) [ ] } { .

整数型定義：int -> (+|-)(0 | digit1 digit*)

文字定数：char -> letter|digit|……

文字列定数：string -> char*

実数型定義：double-> (0|(+|-)digit1 digit*)(.digit*)

私が描いた DFA は以下の通りです。

2、使用するデータ構造#

出力 Token 流はタイプ名 + 種別コード + 値（そのキーワード / 変数名 / 数字 / 演算子 / 区切り記号）、出力ストリームをオーバーロードします。

struct Token {
    int type;   // 種別コード
    string value;       // 値 キーワード/変数名/数字/演算子/区切り記号
    string category;    // 種別コードに対応するタイプ名
    Token(int type, string value, string category) : type(type), value(value), category(category) {}
    friend ostream& operator<<(ostream& os, const Token& t) {
        os << t.category << ", type: " << t.type << ", value: " << t.value;
        return os;
    }
};

3、ヘッダーファイルの宣言とグローバル変数の定義#

以下のように、非常にわかりやすいでしょう。

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <map>
using namespace std;
const string CategoryFileName = "./categoryCode.txt";
const string CodeFileName = "./code.txt";
string keywords[22];  // キーワード表 種別コード1-22
string operate[28];  // 演算子表 種別コード23-50
string delimiter[15];  // 区切り記号表 種別コード51-65
map<string, int> categoryCode;  // 種別コード表
const string op = "+-*/%=!&|<>";
const int _EOF_ = -2;
const int _ERROR_ = -1;
enum { 
    _ID_, _INT_, _DOUBLE_, _OPERATOR_, _DELIMITER_, _KEYWORD_, _CHAR_, _STRING_, _COMMENT_, _SPACE_
};  // タイプ
string cat[10] = { "id", "int", "double", "operator", "delimiter", "keyword", "char", "string", "comment", "space" };
struct Token {
    int type;   // 種別コード
    string value;       // 値 キーワード/変数名/数字/演算子/区切り記号
    string category;    // 種別コードに対応するタイプ名
    Token(int type, string value, string category) : type(type), value(value), category(category) {}
    friend ostream& operator<<(ostream& os, const Token& t) {
        os << t.category << ", type: " << t.type << ", value: " << t.value;
        return os;
    }
};
int pos, len;  // 現在の文字位置と長さ
string code, tempToken;  // 現在認識している文字列
vector<Token> tokenList;  // 認識されたトークンを格納

4、関数の概要#

（1）関数概要表#

関数名	機能概要
`readFile`	ファイルを読み込む関数、string の動的配列を返す
`init`	初期化関数、この関数で種別コードファイル、キーワードファイルを読み込み、相応の値を設定し初期化を行う
`peek`	次の文字を探る、存在する場合はその文字を返し、存在しない場合は \0 すなわち文字列の終端を返す
`isDigit`	文字 ch が数字 0-9 であるかを判断する
`isLetter`	文字 ch がアルファベットまたはアンダースコア（すなわち A-Z a-z _ ）であるかを判断する
`isKeyword`	文字列 s がキーワードであるかを判断する（キーワード表に存在するか）
`isOP`	文字 ch が単一の演算子であるかを判断する（op に存在するか）
`isOperator`	文字列 s が演算子であるかを判断する（演算子表に存在するか）
`isDelimiter`	文字 ch が区切り記号であるかを判断する（operate に存在するか）
`judge`	中心関数、現在の文字（`code[pos]`）の列挙型を判断し、特定のトークンを処理した後、直接`tokenList`に入れる（コメント、文字、文字列定数など）
`read_next`	中心関数、次の文字を読み込み、返された列挙型に基づいて対応するトークンを`tokenList`に入れる
`main`	主プログラムのエントリーポイント、ここから入る、`init`関数を呼び出して初期化

（2）関数の呼び出し関係#

5、実験結果#

入力#

code.txt

int main() {
    char ch = 'ss';
    string str = "Hello, World!"
    char ch2 = 's';
    init();
    double x = 10.31;/* some comment */
    int m = 0;
    int y = 310, m = 0.31;
    while(pos < len) {
        int flag = read_next();
        if(flag == _EOF_) break;
        if(flag != _ERROR_) {
            Token t(flag, tempToken);
            tokenList.push_back(t);
            cout << t << endl;
        } else cout << "Error!" << endl;
    }
    return 0;
}

出力#

$0I3RK{8 N%JNPNA</code></code></code></code></code></code></code></code></code></code></code></code></code></code></code></code>(WB_R.png$

四、実験まとめ#

今回の実験は非常に面白く、最終的に動作したときは非常に達成感がありました。個人的には、どのアルゴリズムを使用するかにこだわる必要はなく、自分の考えを整理し、どのように設計すればこのプログラムが正しく認識できるかを考えることが重要だと思います。主に優先順位の考え方があり、空白と改行はスキップされ、最初に数字または文字であるかを判断し、次に相応の処理を行い、特定の区切り記号の判断を行います。文字列やコメントなどです。このプロセスを説明するには、コードが十分だと思います。このプログラムは、現在は一般的な記号（.）のみを使用して小数をサポートしていますが、もっと必要な場合は judge の isdigit () の後で修正できます。修正はそれほど難しくありません。明らかに、judge 関数の中の関数はさらに詳細な関数に分割することができますが、それは後で補完します。

五、考察問題の回答#

プログラム設計のどの段階が字句解析の効率に影響を与えるか？効率をどう向上させるか？#

答：最適化の余地はまだ多くあります。例えば、キーワードであるかを判断する際、現在の方法は、識別子またはキーワードである可能性のある文字列を読み終えた後、文字配列に格納してからキーワード表と逐一照合していますが、読み込みの際に判断するように変更すれば、効率が向上します。また、区切り記号の照合も同様です。

完全なコード#

/*
 * @Author: cos
 * @Date: 2022-04-05 00:10:59
 * @LastEditTime: 2022-04-08 02:37:49
 * @LastEditors: cos
 * @Description: 字句解析器設計実装
 * @FilePath: \CS\experiment_1\demo\main.cpp
 */
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <map>
using namespace std;
const string CategoryFileName = "./categoryCode.txt";
const string CodeFileName = "./code.txt";
string keywords[22];  // キーワード表 種別コード1-22
string operate[28];  // 演算子表 種別コード23-50
string delimiter[15];  // 区切り記号表 種別コード51-65
map<string, int> categoryCode;  // 種別コード表
const string op = "+-*/%=!&|<>";
const int _EOF_ = -2;
const int _ERROR_ = -1;
enum { 
    _ID_, _INT_, _DOUBLE_, _OPERATOR_, _DELIMITER_, _KEYWORD_, _CHAR_, _STRING_, _COMMENT_, _SPACE_
};  // タイプ
string cat[10] = { "id", "int", "double", "operator", "delimiter", "keyword", "char", "string", "comment", "space" };
struct Token {
    int type;   // 種別コード
    string value;       // 値 キーワード/変数名/数字/演算子/区切り記号
    string category;    // 種別コードに対応するタイプ名
    Token(int type, string value, string category) : type(type), value(value), category(category) {}
    friend ostream& operator<<(ostream& os, const Token& t) {
        os << t.category << ", type: " << t.type << ", value: " << t.value;
        return os;
    }
};
int pos, len;  // 現在の文字位置と長さ
string code, tempToken;  // 現在認識している文字列
vector<Token> tokenList;  // 認識されたトークンを格納
// ファイルを読む
vector<string> readFile(string fileName) {
    vector<string> res;
    try {
        ifstream fin;
        fin.open(fileName);
        string temp;
        while (getline(fin, temp))
            res.push_back(temp);
        return res;
    } catch(const exception& e) {
        cerr << e.what() << '\n';
        return res;
    }
}
void init() {
    vector<string> res = readFile(CategoryFileName); 
    // cout << "len:" << len << endl;
    for(int i = 0; i < 22; ++i) {
        keywords[i] = res[i];
        categoryCode[res[i]] = i+1;
        // cout << "keyword:" << res[i] << endl;
    }
    for(int i = 0; i < 28; ++i) {
        operate[i] = res[i + 22];
        categoryCode[res[i+22]] = i+23;
        // cout << "operate:" << res[i + 22] << endl;
    }
    for(int i = 0; i < 15; ++i) {
        delimiter[i] = res[i + 50];
        categoryCode[res[i+50]] = i+51;
        // cout << "delimiter:" << res[i + 50] << endl;
    }
    res = readFile(CodeFileName);
    for(int i = 0; i < res.size(); ++i)
        code += res[i]+'\n';
    len = code.size();
}
char peek() {
    if (pos+1 < len) return code[pos+1];
    else return '\0';
}
inline bool isDigit(char c) {
    return c >= '0' && c <= '9';
}
// アルファベットまたはアンダースコアかどうか
inline bool isLetter(char c) {
    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_';
}
bool isKeyword(string s) {
    for(int i = 0; i < 22; ++i)
        if (s == keywords[i]) 
            return true;
    return false;
}
bool isOP(char ch) {
    return op.find(ch) != string::npos;
}
bool isOperator(string s) {
    for(int i = 0; i < 28; ++i)
        if (s == operate[i]) return true;
    return false;
}
bool isDelimiter(char ch) {
    for(int i = 0; i < 15; ++i)
        if (ch == delimiter[i][0]) return true;
    return false;
}
int judge(char ch) {
    if(ch == '\n' || ch == ' ') return _SPACE_;
    if(isDigit(ch)) {
        char nextChar = peek();
        if(ch == '0' && nextChar == '.') { // 0.いくつか
            ++pos;
            if(!isDigit(peek()))   // .の後に数字がない
                return _ERROR_;
            tempToken = "0.";
            while(isDigit(peek())) {
                tempToken += peek();
                ++pos;
            }
            return _DOUBLE_;    // 8
        } else if(ch == '0' && !isDigit(nextChar)) { // 単なる0
            tempToken = "0";
            return _INT_;   // 5
        } else if(ch != '0') {  // digit1
            tempToken = ch;
            while(isDigit(peek())) {
                tempToken += peek();
                ++pos;
            }
            char nextChar = peek();
            if(nextChar == '.') {
                tempToken += nextChar;
                ++pos;
                nextChar = peek();
                if(isDigit(nextChar)) {
                    tempToken += peek();
                    ++pos;
                    while(isDigit(peek())) {
                        tempToken += peek();
                        ++pos;
                    }
                    return _DOUBLE_;    // 8
                } else return _ERROR_;
            } else return _INT_;    // 6
        } else {    // 0+数字
            ++pos;
            return _ERROR_;         // エラー
        }
    }
    if(isLetter(ch)) {
        tempToken = ch;
        char nextChar = peek();
        while( isLetter(nextChar) || isDigit(nextChar) ) { // 識別子~
            tempToken += nextChar;
            ++pos;
            nextChar = peek();
        }
        return isKeyword(tempToken) ? _KEYWORD_ : _ID_;
    } 
    if(ch == '\"') {
        tokenList.push_back(Token(54, "\"", cat[_DELIMITER_]));
        tempToken = "";
        char nextChar = peek();
        while(nextChar != '\"') {
            tempToken += nextChar;
            ++pos;
            nextChar = peek();
        }
        tokenList.push_back(Token(69, tempToken, cat[_STRING_]));
        tokenList.push_back(Token(54, "\"", cat[_DELIMITER_]));
        pos += 2;
        return _STRING_;
    }
    if(ch == '\'') {
        tempToken = "";
        ++pos;
        char nextChar = peek();
        if(nextChar == '\'') {
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            tempToken += code[pos];
            tokenList.push_back(Token(68, tempToken, cat[_CHAR_]));
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            ++pos;
            return _CHAR_;
        } else if(code[pos] == '\'') {
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            tokenList.push_back(Token(68, tempToken, cat[_CHAR_]));  // 空文字列
            tokenList.push_back(Token(53, "\'", cat[_DELIMITER_]));
            return _CHAR_;
        } else {
            while(pos < len && nextChar != '\'') {
                ++pos;
                nextChar = peek();
            }
            ++pos;
            return _ERROR_;
        }
    }
    if(ch == '/') {
        if(peek() == '*') {
            ++pos;
            char nextChar = peek();
            ++pos;
            tempToken = "";
            while(pos < len) {
                if(nextChar == '*' && peek() == '/') {
                    tokenList.push_back(Token(55, "/*", cat[_DELIMITER_]));
                    tokenList.push_back(Token(71, tempToken, cat[_COMMENT_]));
                    tokenList.push_back(Token(56, "*/", cat[_DELIMITER_]));
                    ++pos;
                    ++pos;
                    return _COMMENT_;
                } else {
                    tempToken += nextChar;
                    nextChar = peek();
                    ++pos;
                }
            }
            return _ERROR_;
        }
    }

    if(isOP(ch)) {   // op演算子
        tempToken = "";
        tempToken += ch;
        char nextChar = peek();
        if(isOP(nextChar)) {
            if(isOperator(tempToken + nextChar)) {
                tempToken += nextChar;
                ++pos;
                return _OPERATOR_;      // 15
            } else return _OPERATOR_;   // 14
        } else return _OPERATOR_;       // 14
    }
    if(isDelimiter(ch)) {
        tempToken = "";
        tempToken += ch;
        return _DELIMITER_;
    }
    return _ERROR_;  
}
int read_next() {
    int type = judge(code[pos]);
    while(pos < len && type == _SPACE_) {
        ++pos;
        type = judge(code[pos]);
    }
    if(pos >= len) return _EOF_; 
    ++pos;
    if(type == _ERROR_) return _ERROR_;
    if(type == _DOUBLE_) {
        // cout << "double: " << tempToken << endl;
        tokenList.push_back(Token(67, tempToken, cat[_DOUBLE_]));
        return _DOUBLE_;
    }
    if(type == _INT_) {
        // cout << "int: " << tempToken << endl;
        tokenList.push_back(Token(66, tempToken, cat[_INT_]));
        return _INT_;
    }
    if(type == _ID_) {  // 識別子
        // cout << "id: " << tempToken << endl;
        tokenList.push_back(Token(70, tempToken, cat[_ID_]));
        return _ID_;
    }
    if(type == _OPERATOR_ || type == _KEYWORD_ || type == _DELIMITER_) {
        tokenList.push_back(Token(categoryCode[tempToken], tempToken, cat[type]));
        return type;
    }
    return _ERROR_;
}
int main() {
    init();
    while(pos < len) {
        int flag = read_next();
        if(flag == _EOF_) break;
        else if(flag == _ERROR_) tokenList.push_back(Token(_ERROR_, "ERROR!", "ERROR"));
    }
    for(auto t : tokenList)
        cout << t << endl;
    return 0;
}