Compiling a Functional Language Using C++, Part 13

Table of Contents

In part 12, we added let/in and lambda expressions to our compiler. At the end of that post, I mentioned that before we move on to bigger and better things, I wanted to take a step back and clean up the compiler. Now is the time to do that.

In particular, I identified four things that could be improved or cleaned up:

Error handling. We need to stop using throw 0 and start using assert. We can also make our errors much more descriptive by including source locations in the output.
Name mangling. I don’t think I got it quite right last time. Now is the time to clean it up.
Code organization. I think we can benefit from a top-level class, and a more clear “dependency order” between the various classes and structures we’ve defined.
Code style. In particular, I’ve been lazily using struct in a lot of places. That’s not a good idea; it’s better to use class, and only expose some fields and methods to the rest of the code.

Error Reporting and Handling

The previous post was rather long, which led me to omit a rather important aspect of the compiler: proper error reporting. Once again our compiler has instances of throw 0, which is a cheap way of avoiding properly handling a runtime error. Before we move on, it’s best to get rid of such blatantly lazy code.

Our existing exceptions (mostly type errors) can use some work, too. Even the most descriptive issues our compiler reports – unification errors – don’t include the crucial information of where the error is. For large programs, this means having to painstakingly read through the entire file to try figure out which subexpression could possibly have an incorrect type. This is far from the ideal debugging experience.

Addressing all this is a multi-step change in itself. We want to:

Replace all throw 0 code with actual exceptions.
Replace some exceptions that shouldn’t be possible for a user to trigger with assertions.
Keep track of source locations of each subexpression, so that we may be able to print it if it causes an error.
Be able to print out said source locations at will. This isn’t a necessity, but virtually all “big” compilers do this. Instead of reporting that an error occurs on a particular line, we will actually print the line.

Let’s start with gathering the actual location data.

Bison’s Locations

Bison actually has some rather nice support for location tracking. It can automatically assemble the “from” and “to” locations of a nonterminal from the locations of children, which would be very tedious to write by hand. We enable this feature using the following option:

From parser.y, line 46

46

%locations

There’s just one hitch, though. Sure, Bison can compute bigger locations from smaller ones, but it must get the smaller ones from somewhere. Since Bison operates on tokens, rather than characters, it effectively doesn’t interact with the source text at all, and can’t determine from which line or column a token originated. The task of determining the locations of input tokens is delegated to the tokenizer – Flex, in our case. Flex, on the other hand, doesn’t have a built-in mechanism for tracking locations. Fortunately, Bison provides a yy::location class that includes most of the needed functionality.

A yy::location consists of two source positions, begin and end, which themselves are represented using lines and columns. It also has the following methods:

yy::location::columns(int) advances the end position by the given number of columns, while begin stays the same. If begin and end both point to the beginning of a token, then columns(token_length) will move end to the token’s end, and thus make the whole location contain the token.
yy::location::lines(int) behaves similarly to columns, except that it advances end by the given number of lines, rather than columns. It also resets the columns counter to 1.
yy::location::step() moves begin to where end is. This is useful for when we’ve finished processing a token, and want to move on to the next one.

For Flex specifically, yyleng has the length of the token currently being processed. Rather than adding the calls to columns and step to every rule, we can define the YY_USER_ACTION macro, which is run before each token is processed.

From scanner.l, lines 12 through 14

12
13
14

#define YY_USER_ACTION \
    drv.get_file_manager().write(yytext, yyleng); \
    LOC.step(); LOC.columns(yyleng);

We’ll see why we are using LOC instead of something like location soon; for now, you can treat LOC as if it were a global variable declared in the tokenizer. Before processing each token, we ensure that the yy::location has its begin and end at the same position, and then advance end by yyleng columns. This is sufficient [note: This doesn't hold for all languages. It may be possible for a language to have tokens that contain \n, in which case, rather than just using yyleng, we'd need to add special logic to iterate over the token and detect the line breaks.

Also, this requires that the end of the previous token was correctly computed. ] to make LOC represent our token’s source position. For the moment, don’t worry too much about drv; this is the parsing driver, and we will talk about it shortly.

So now we have a “global” variable LOC that gives us the source position of the current token. To get it to Bison, we have to pass it as an argument to each of the make_TOKEN calls. Here are a few sample lines that should give you the general idea:

From scanner.l, lines 40 through 43



40
41
42
43



= { return yy::parser::make_EQUAL(LOC); }
[a-z][a-zA-Z]* { return yy::parser::make_LID(std::string(yytext), LOC); }
[A-Z][a-zA-Z]* { return yy::parser::make_UID(std::string(yytext), LOC); }
<<EOF>> { return yy::parser::make_YYEOF(LOC); }

That last line is actually new. Previously, we somehow got away without explicitly sending the end-of-file token to Bison. I suspect that this was due to some kind of implicit conversion of the Flex macro YY_NULL into a token; now that we have to pass a position to every token constructor, such an implicit conversion is probably impossible.

Now we have Bison computing source locations for each nonterminal. However, at the moment, we still aren’t using them. To change that, we need to add a yy::location argument to each of our ast nodes, as well as to the pattern subclasses, definition_defn and definition_data. To avoid breaking all the code that creates AST nodes and definitions outside of the parser, we’ll make this argument optional. Inside of ast.hpp, we define a new field as follows:

From ast.hpp, line 16

16

    yy::location loc;

Then, we add a constructor to ast as follows:

From ast.hpp, line 18

18

    ast(yy::location l) : env(nullptr), loc(std::move(l)) {}

Note that it’s not optional here, since ast itself is an abstract class, and thus will never be constructed directly. It is in the subclasses of ast that we provide a default value. The change is rather mechanical, but here’s an example from ast_binop:

From ast.hpp, lines 98 through 99

98
99

    ast_binop(binop o, ast_ptr l, ast_ptr r, yy::location lc = yy::location())
        : ast(std::move(lc)), op(o), left(std::move(l)), right(std::move(r)) {}

Finally, we tell Bison to pass the computed location data as an argument when constructing our data structures. This too is a mechanical change, and I think the following few lines demonstrate the general idea in sufficient detail:

From parser.y, lines 92 through 96



92
93
94
95
96



aMul
    : aMul TIMES app { $$ = ast_ptr(new ast_binop(TIMES, std::move($1), std::move($3), @$)); }
    | aMul DIVIDE app { $$ = ast_ptr(new ast_binop(DIVIDE, std::move($1), std::move($3), @$)); }
    | app { $$ = std::move($1); }
    ;

Here, the @$ character is used to reference the current nonterminal’s location data.

Line Offsets, File Input, and the Parsing Driver

There are three more challenges with printing out the line of code where an error occurred. First of all, to print out a line of code, we need to have that line of code available to us. We do not currently meet this requirement: our compiler reads code form stdin (as is default for Flex), and stdin doesn’t always support rewinding. This, in turn, means that once Flex has read a character from the input, it may not be possible to go back and retrieve that character again.

Second, even if we do have have the entire stream or buffer available to us, to retrieve an offset and length within that buffer from just a line and column number would be a lot of work. A naive approach would be to iterate through the input again, once more keeping track of lines and columns, and print the desired line once we reach it. However, this would lead us to redo a lot of work that our tokenizer is already doing.

Third, Flex’s input mechanism, even if it it’s configured not to read from stdin, uses a global file descriptor called yyin. However, we’re better off minimizing global state (especially if we want to read, parse, and compile multiple files in the future). While we’re configuring Flex’s input mechanism, we may as well fix this, too.

There are several approaches to fixing the first issue. One possible way is to store the content of stdin into a temporary file. Then, it’s possible to read from the file multiple times by using the C functions fseek and rewind. However, since we’re working with files, why not just work directly with the files created by the user? Instead of reading from stdin, we may as well take in a path to a file via argv, and read from there. Also, instead of fseek and rewind, we can just read the file into memory, and access it like a normal character buffer. This does mean that we can stick with stdin, but it’s more conventional to read source code from files, anyway.

To address the second issue, we can keep a mapping of line numbers to their locations in the source buffer. This is rather easy to maintain using an array: the first element of the array is 0, which is the beginning of the first line in any source file. From there, every time we encounter the character \n, we can push the current source location to the top, marking it as the beginning of another line. Where exactly we store this array is as yet unclear, since we’re trying to avoid global variables.

Finally, to begin addressing the third issue, we can use Flex’s reentrant option, which makes it so that all of the tokenizer’s state is stored in an opaque yyscan_t structure, rather than in global variables. This way, we can configure yyin without setting a global variable, which is a step in the right direction. We’ll work on this momentarily.

Our tokenizing and parsing stack has more global variables than just those specific to Flex. Among these variables is global_defs, which receives all the top-level function and data type definitions. We will also need some way of accessing the yy::location instance, and a way of storing our file input in memory. Fortunately, we’re not the only ones to have ever come across the issue of creating non-global state: the Bison documentation has a section in its C++ guide that describes a technique for manipulating state – “parsing context”, in their words. This technique involves the creation of a parsing driver.

The parsing driver is a class (or struct) that holds all the parse-related state. We can arrange for this class to be available to our tokenizing and parsing functions, which will allow us to use it pretty much like we’d use a global variable. This is the drv that we saw in YY_USER_ACTION. We can define it as follows:

From parse_driver.hpp, lines 36 through 54



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54



class parse_driver {
    private:
        std::string file_name;
        yy::location location;
        definition_group* global_defs;
        file_mgr* file_m;

    public:
        parse_driver(
                file_mgr& mgr,
                definition_group& defs,
                const std::string& file)
            : file_name(file), file_m(&mgr), global_defs(&defs) {}

        bool operator()();
        yy::location& get_current_location();
        file_mgr& get_file_manager() const;
        definition_group& get_global_defs() const;
};

There aren’t many fields here. The file_name string represents the file that we’ll be reading code from. The location field will be accessed by Flex via get_current_location. Bison will store the function and data type definitions it reads into global_defs via get_global_defs. Finally, file_m will be used to keep track of the content of the file we’re reading, as well as the line offsets within that file. Notice that a couple of these fields are pointers that we take by reference in the constructor. The parse_driver doesn’t own the global definitions, nor the file manager. They exist outside of it, and will continue to be used in other ways the parse_driver does not need to know about. Also, the LOC variable in Flex is actually a call to get_current_location:

From scanner.l, line 15

15

#define LOC drv.get_current_location()

The methods of parse_driver are rather simple. The majority of them deals with giving access to the parser’s members: the yy::location, the definition_group, and the file_mgr. The only exception to this is operator(), which we use to actually trigger the parsing process. We’ll make this method return true if parsing succeeded, and false otherwise (if, say, the file we tried to read doesn’t exist). Here’s its implementation:

From parse_driver.cpp, lines 48 through 60



48
49
50
51
52
53
54
55
56
57
58
59
60



bool parse_driver::operator()() {
    FILE* stream = fopen(file_name.c_str(), "r");
    if(!stream) return false;
    yyscan_t scanner;
    yylex_init(&scanner);
    yyset_in(stream, scanner);
    yy::parser parser(scanner, *this);
    parser();
    yylex_destroy(scanner);
    fclose(stream);
    file_m->finalize();
    return true;
}

We try open the user-specified file, and return false if we can’t. After this, we start doing the setup specific to a reentrant Flex scanner. We declare a yyscan_t variable, which will contain all of Flex’s state. Then, we initialize it using yylex_init. Finally, since we can no longer touch the yyin global variable (it doesn’t exist), we have to resort to using a setter function provided by Flex to configure the tokenizer’s input stream.

Next, we construct our Bison-generated parser. Note that unlike before, we have to pass in two arguments: scanner and *this, the latter being of type parse_driver&. We’ll come back to how this works in a moment. With the scanner and parser initialized, we invoke parser::operator(), which actually runs the Flex- and Bison-generated code. To clean up, we run yylex_destroy and fclose. Finally, we call file_mgr::finalize, and return. But what is file_mgr?

The file_mgr class does two things: it stores the part of the file that has already been read by Flex in memory, and it keeps track of where each line in our source file begins within the text. Here is its definition:

From parse_driver.hpp, lines 14 through 34



14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34



class file_mgr {
    private:
        std::ostringstream string_stream;
        std::string file_contents;

        size_t file_offset;
        std::vector<size_t> line_offsets;
    public:
        file_mgr();

        void write(const char* buffer, size_t len);
        void mark_line();
        void finalize();

        size_t get_index(int line, int column) const;
        size_t get_line_end(int line) const;
        void print_location(
                std::ostream& stream,
                const yy::location& loc,
                bool highlight = true) const;
};

In this class, the string_stream member is used to construct an std::string from the bits of text that Flex reads, processes, and feeds to the file_mgr using the write method. It’s more efficient to use a string stream than to concatenate strings repeatedly. Once Flex is finished processing the file, the final contents of the string_stream are transferred into the file_contents string using the finalize method. The offset and line_offsets fields will be used as we described earlier: each time Flex encounters the \n character, the offset variable will pushed in top of the line_offsets vector, marking the beginning of the corresponding line. The methods of the class are as follows:

write will be called from Flex, and will allow us to record the content of the file we’re processing to the string_stream. We’ve already seen it used in the YY_USER_ACTION macro.
mark_line will also be called from Flex, and will mark the current file_offset as the beginning of a line by pushing it into line_offsets.
finalize will be called by the parse_driver when the parsing finishes. At this time, the string_stream should contain all of the input file, and this data is transferred to file_contents, as we mentioned above.
get_index and get_line_end will be used for converting yy::location instances to offsets within the source code buffer.
print_location will be used for printing errors. It will print the lines spanned by the given location, with the location itself colored and underlined if the last argument is true. This will make our errors easier on the eyes.

Let’s take a look at their implementations. First, write. For the most part, this method is a proxy for the write method of our string_stream:

From parse_driver.cpp, lines 9 through 12



 9
10
11
12



void file_mgr::write(const char* buf, size_t len) {
    string_stream.write(buf, len);
    file_offset += len;
}

We do, however, also keep track of the file_offset variable here, which ensures we have up-to-date information regarding our position in the source file. The implementation of mark_line uses this information:

From parse_driver.cpp, lines 14 through 16

14
15
16

void file_mgr::mark_line() {
    line_offsets.push_back(file_offset);
}

The finalize method is trivial, and requires little additional discussion:

From parse_driver.cpp, lines 18 through 20

18
19
20

void file_mgr::finalize() {
    file_contents = string_stream.str();
}

Once we have the line offsets, get_index becomes very simple:

From parse_driver.cpp, lines 22 through 25



22
23
24
25



size_t file_mgr::get_index(int line, int column) const {
    assert(line > 0 && line <= line_offsets.size());
    return line_offsets.at(line-1) + column - 1;
}

Here, we use an assertion for the first time. Calling get_index with a negative or zero line doesn’t make any sense, since Bison starts tracking line numbers at 1. Similarly, asking for a line for which we don’t have a recorded offset is invalid. Both of these nonsensical calls to get_index cannot be caused by the user under normal circumstances, and indicate the method’s misuse by the author of the compiler (us!). Thus, we terminate the program.

Finally, the implementation of line_end just finds the beginning of the next line. We stick to the C convention of marking ’end’ indices exclusive (pointing just past the end of the array):

From parse_driver.cpp, lines 27 through 30



27
28
29
30



size_t file_mgr::get_line_end(int line) const {
    if(line == line_offsets.size()) return file_contents.size();
    return get_index(line+1, 1);
}

Since line_offsets has as many elements as there are lines, the last line number would be equal to the vector’s size. When looking up the end of the last line, we can’t look for the beginning of the next line, so instead we return the end of the file.

Next, the print_location method prints three sections of the source file. These are the text “before” the error, the error itself, and, finally, the text “after” the error. For example, if an error began on the fifth column of the third line, and ended on the eighth column of the fourth line, the “before” section would include the first four columns of the third line, and the “after” section would be the ninth column onward on the fourth line. Before and after the error itself, if the highlight argument is true, we sprinkle the ANSI escape codes to enable and disable special formatting, respectively. For now, the special formatting involves underlining the text and making it red.

From parse_driver.cpp, lines 32 through 46



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46



void file_mgr::print_location(
        std::ostream& stream,
        const yy::location& loc,
        bool highlight) const {
    size_t print_start = get_index(loc.begin.line, 1);
    size_t highlight_start = get_index(loc.begin.line, loc.begin.column);
    size_t highlight_end = get_index(loc.end.line, loc.end.column);
    size_t print_end = get_line_end(loc.end.line);
    const char* content = file_contents.c_str();
    stream.write(content + print_start, highlight_start - print_start);
    if(highlight) stream << "\033[4;31m";
    stream.write(content + highlight_start, highlight_end - highlight_start);
    if(highlight) stream << "\033[0m";
    stream.write(content + highlight_end, print_end - highlight_end);
}

Finally, to get the forward declarations for the yy* functions and types, we set the header-file option in Flex:

From scanner.l, line 3

3

%option header-file="scanner.hpp"

We also include this scanner.hpp file in our parse_driver.cpp:

From parse_driver.cpp, line 2

2

#include "scanner.hpp"

Adding the Driver to Flex and Bison

Bison’s C++ language template generates a class called yy::parser. We don’t really want to modify this class in any way: not only is it generated code, but it’s also rather complex. Instead, Bison provides us with a mechanism to pass more data in to the parser. This data is made available to all the actions that the parser runs. Better yet, Bison also attempts to pass this data on to the tokenizer, which in our case would mean that whatever data we provide Bison will also be available to Flex. This is how we’ll allow the two components to access our new parse_driver class. This is also how we’ll pass in the yyscan_t that Flex now needs to run its tokenizing code. To do all this, we use Bison’s %param option. I’m going to include a few more lines from parser.y, since they contain the necessary #include directives and a required type definition:

From parser.y, lines 1 through 18



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18



%code requires {
#include <string>
#include <vector>
#include "ast.hpp"
#include "definition.hpp"
#include "parser.hpp"
#include "parsed_type.hpp"

class parse_driver;
using yyscan_t = void*;
}

%param { yyscan_t scanner }
%param { parse_driver& drv }

%code {
#include "parse_driver.hpp"
}

The %param option effectively adds the parameter listed between the curly braces to the constructor of the generated yy::parser. We’ve already seen this in the implementation of our driver, where we passed scanner and *this as arguments when creating the parser. The parameters we declare are also passed to the yylex function, which is expected to accept them in the same order.

Since we’re adding parse_driver as an argument we have to declare it. However, we can’t include the parse_driver header right away because parse_driver itself includes the parser header: we’d end up with a circular dependency. Instead, we resort to forward-declaring the driver class, as well as the yyscan_t structure containing Flex’s state.

Adding a parameter to Bison doesn’t automatically affect Flex. To let Flex know that its yylex function must now accept the state and the parsing driver, we have to define the YY_DECL macro. We do this in parse_driver.hpp, since this forward declaration will be used by both Flex and Bison:

From parse_driver.hpp, lines 56 through 58

56
57
58

#define YY_DECL yy::parser::symbol_type yylex(yyscan_t yyscanner, parse_driver& drv)

YY_DECL;

Improving Exceptions

Now, it’s time to add location data (and a little bit more) to our exceptions. We want to make it possible for exceptions to include data about where the error occurred, and to print this data to the user. However, it’s also possible for us to have exceptions that simply do not have that location data. Furthermore, we want to know whether or not an exception has an associated location; we’d rather not print an invalid or “default” location when an error occurs.

In the old days of programming, we could represent the absence of location data with a nullptr, or NULL. But not only does this approach expose us to all kind of NULl-safety bugs, but it also requires heap allocation! This doesn’t make it sound all that appealing; instead, I think we should opt for using std::optional.

Though std::optional is standard (as may be obvious from its namespace), it’s a rather recent addition to the C++ STL. In order to gain access to it, we need to ensure that our project is compiled using C++17. To this end, we add the following two lines to our CMakeLists.txt:

From CMakeLists.txt, lines 5 through 6

5
6

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

Now, let’s add a new base class for all of our compiler errors, unsurprisingly called compiler_error:

From error.hpp, lines 10 through 26



10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25



class compiler_error : std::exception {
    private:
        std::string description;
        maybe_location loc;

    public:
        compiler_error(std::string d, maybe_location l = std::nullopt)
            : description(std::move(d)), loc(std::move(l)) {}

        const char* what() const noexcept override;

        void print_about(std::ostream& to);
        void print_location(std::ostream& to, file_mgr& fm, bool highlight = false);

        void pretty_print(std::ostream& to, file_mgr& fm);
};

We’ll put some ‘common’ exception functionality into the print_location and print_about methods. If the error has an associated location, the former method will print that location to the screen. We don’t always want to highlight the part of the code that caused the error: for instance, an invalid data type definition may span several lines, and coloring that whole section of text red would be too much. To address this, we add the highlight boolean argument, which can be used to switch the colors on and off. The print_about method will simply print the what() message of the exception, in addition to the “specific” error that occurred (stored in description). Here are the implementations of the functions:

From error.cpp, lines 3 through 16



 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16



const char* compiler_error::what() const noexcept {
    return "an error occured while compiling the program";
}

void compiler_error::print_about(std::ostream& to) {
    to << what() << ": ";
    to << description << std::endl;
}

void compiler_error::print_location(std::ostream& to, file_mgr& fm, bool highlight) {
    if(!loc) return;
    to << "occuring on line " << loc->begin.line << ":" << std::endl;
    fm.print_location(to, *loc, highlight);
}

We will also add a pretty_print method to all of our exceptions. This method will handle all the exception-specific printing logic. For the generic compiler error, this means simply printing out the error text and the location:

From error.cpp, lines 18 through 21



18
19
20
21



void compiler_error::pretty_print(std::ostream& to, file_mgr& fm) {
    print_about(to);
    print_location(to, fm);
}

For type_error, this logic slightly changes, enabling colors when printing the location:

From error.cpp, lines 27 through 30



27
28
29
30



void type_error::pretty_print(std::ostream& to, file_mgr& fm) {
    print_about(to);
    print_location(to, fm, true);
}

Finally, for unification_error, we also include the code to print out the two types that our compiler could not unify:

From error.cpp, lines 32 through 41



32
33
34
35
36
37
38
39
40
41



void unification_error::pretty_print(std::ostream& to, file_mgr& fm, type_mgr& mgr) {
    type_error::pretty_print(to, fm);
    to << "the expected type was:" << std::endl;
    to << "  \033[34m";
    left->print(mgr, to);
    to << std::endl << "\033[0mwhile the actual type was:" << std::endl;
    to << "  \033[32m";
    right->print(mgr, to);
    to << "\033[0m" << std::endl;
}

There’s a subtle change here. Compared to the previous type-printing code (which we had in main), what we wrote here deals with “expected” and “actual” types. The left type passed to the exception is printed first, and is treat like the “correct” type. The right type, on the other hand, is treated like the “wrong” type that should have been unifiable with left. This will affect the calling conventions of our unification code.

Now, we can go through and find all the places where we throw 0. One such place was in the data type definition code, where declaring the same type parameter twice is invalid. We replace the 0 with a compiler_error:

From definition.cpp, lines 66 through 69



66
67
68
69



        if(var_set.find(var) != var_set.end())
            throw compiler_error(
                    "type variable " + var +
                    " used twice in data type definition.", loc);

Not all throw 0 statements should become exceptions. For example, here’s code from the previous version of the compiler:

From definition.cpp, lines 123 through 127



123
124
125
126
127



        for(auto& dependency : def_defn.second->nearby_variables) {
            if(defs_defn.find(dependency) == defs_defn.end())
                throw 0;
            dependency_graph.add_edge(def_defn.second->name, dependency);
        }

If a definition def_defn has a dependency on a “nearby” (declared in the same group) definition called dependency, and if dependency does not exist within the same definition group, we throw an exception. But this error is impossible for a user to trigger: the only reason for a variable to appear in the nearby_variables vector is that it was previously found in the definition group. Here’s the code that proves this (from the current version of the compiler):

From definition.cpp, lines 102 through 106



102
103
104
105
106



            if(defs_defn.find(free_var) == defs_defn.end()) {
                into.insert(free_var);
            } else {
                def_pair.second->nearby_variables.insert(free_var);
            }

Not being able to find the variable in the definition group is a compiler bug, and should never occur. So, instead of throwing an exception, we’ll use an assertion:

From definition.cpp, line 128

128

            assert(defs_defn.find(dependency) != defs_defn.end());

For more complicated error messages, we can use a stringstream. Here’s an example from parsed_type:

From parsed_type.cpp, lines 16 through 23



16
17
18
19
20
21
22
23



    if(base_type->arity != arguments.size()) {
        std::ostringstream error_stream;
        error_stream << "invalid application of type ";
        error_stream << name;
        error_stream << " (" << base_type->arity << " argument(s) expected, ";
        error_stream << "but " << arguments.size() << " provided)";
        throw type_error(error_stream.str());
    }

In general, this change is also rather mechanical. Before we move on, to maintain a balance between exceptions and assertions, here are a couple more assertions from type_env:

From type_env.cpp, lines 81 through 82

81
82

    assert(names_it != names.end());
    assert(names_it->second.type->forall.size() == 0);

Once again, it should not be possible for the compiler to try generalize the type of a variable that doesn’t exist, and nor should generalization occur twice.

While we’re on the topic of types, let’s talk about type_mgr::unify. In practice, I suspect that a lot of errors in our compiler will originate from this method. However, at present, this method does not in any way track the locations of where a unification error occurred. To fix this, we add a new loc parameter to unify, which we make optional to allow for unification without a known location. Here’s the declaration:

From type.hpp, line 92

92

        void unify(type_ptr l, type_ptr r, const std::optional<yy::location>& loc = std::nullopt);

The change to the implementation is mechanical and repetitive, so instead of showing you the whole method, I’ll settle for a couple of lines:

From type.cpp, lines 121 through 122

121
122

        unify(larr->left, rarr->left, loc);
        unify(larr->right, rarr->right, loc);

We want to make sure that a location provided to the top-level call to unify is also forwarded to the recursive calls, so we have to explicitly add it to the call.

We’ll also have to update the ‘main’ code to call the pretty_print methods, but there’s another big change that we’re going to make before then. However, once that change is made, our errors will look a lot better. Here is what’s printed out to the user when a type error occurs:

an error occured while checking the types of the program: failed to unify types
occuring on line 2:
    3 + False
the expected type was:
  Int
while the actual type was:
  Bool

Here’s an error that was previously a throw 0 statement in our code:

an error occured while compiling the program: type variable a used twice in data type definition.
occuring on line 1:
data Pair a a = { MkPair a a }

Now, not only have we eliminated the lazy uses of throw 0 in our code, but we’ve also improved the presentation of the errors to the user!

Rethinking Name Mangling

In the previous post, I said the following:

One more thing. Let’s adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a ‘gotcha’ for anyone working on the compiler, we will always use the mangled names during compilation.

Now that I’ve had some more time to think about it (and now that I’ve returned to the compiler after a brief hiatus), I think that this was not the right call. Mangled names make sense when translating to LLVM; we certainly don’t want to declare two LLVM functions with the same name. [note: By the way, LLVM has its own name mangling functionality. If you declare two functions with the same name, they'll appear as function and function.0. Since LLVM uses the Function* C++ values to refer to functions, as long as we keep them seaprate on our end, things will work.

However, in our compiler, name mangling occurs before LLVM is introduced, at translation time. We could create LLVM functions at that time, too, and associate them with variables. But then, our G-machine instructions will be coupled to LLVM, which would not be as clean. ] But things are different for local variables. Our local variables are graphs on a stack, and are not actually compiled to LLVM definitions. It doesn’t make sense to mangle their names, since their names aren’t present anywhere in the final executable. It’s not even “consistent” to mangle them, since global definitions are compiled directly to PushGlobal instructions, while local variables are only referenced through the current env. So, I opted to reverse my decision. We will go back to placing variable names directly into env_var. Here’s an example of this from global_scope.cpp:

From global_scope.cpp, lines 6 through 8

6
7
8

    for(auto it = params.rbegin(); it != params.rend(); it++) {
        new_env = env_ptr(new env_var(*it, new_env));
    }

Now that we’ve started using assertions, I also think it’s worth to put our new invariant – “only global definitions have mangled names” – into code:

From type_env.cpp, lines 36 through 45



36
37
38
39
40
41
42
43
44
45



void type_env::set_mangled_name(const std::string& name, const std::string& mangled) {
    auto it = names.find(name);

    // Can't set mangled name for non-existent variable.
    assert(it != names.end());
    // Local names shouldn't need mangling.
    assert(it->second.vis == visibility::global);

    it->second.mangled_name = mangled;
}

Furthermore, we’ll require that a global definition has a mangled name. This way, we can be more confident that a variable from a PushGlobal instruction is referencing the right function. To achieve this, we change get_mangled_name to stop returning the input string if a mangled name was not found; doing so makes it impossible to check if a mangled name was explicitly defined. Instead, we add two assertions. First, if an environment scope doesn’t contain a variable, then it must have a parent. If it does contain variable, that variable must have a mangled name. We end up with the following:

From type_env.cpp, lines 47 through 55



47
48
49
50
51
52
53
54
55



const std::string& type_env::get_mangled_name(const std::string& name) const {
    auto it = names.find(name);
    if(it != names.end()) {
        assert(it->second.mangled_name);
        return *it->second.mangled_name;
    }
    assert(parent != nullptr);
    return parent->get_mangled_name(name);
}

For this to work, we make one more change. Now that we’ve enabled C++17, we have access to std::optional. We can thus represent the presence or absence of mangled names using an optional field, rather than with the empty string "". I hear that C++ compilers have pretty good empty string optimizations, but nonetheless, I think it makes more sense semantically to use “absent” (nullopt) instead of “empty” (""). Here’s the definition of type_env::variable_data now:

From type_env.hpp, lines 16 through 25



16
17
18
19
20
21
22
23
24
25



        struct variable_data {
            type_scheme_ptr type;
            visibility vis;
            std::optional<std::string> mangled_name;

            variable_data()
                : variable_data(nullptr, visibility::local, std::nullopt) {}
            variable_data(type_scheme_ptr t, visibility v, std::optional<std::string> n)
                : type(std::move(t)), vis(v), mangled_name(std::move(n)) {}
        };

Since looking up a mangled name for non-global variable will now result in an assertion failure, [note: A very wise human at the very dawn of our species once said, "make illegal states unrepresentable". Their friends and family were a little busy making a fire, and didn't really understand what the heck they meant. Now, we kind of do.

It's possible for our type_env to include a variable_data entry that is both global and has no mangled name. But it doesn't have to be this way. We could define two subclasses of variable_data, one global and one local, where only the global one has a mangled_name field. It would be impossible to reach this assertion failure then. ] we have to change ast_lid::compile to only call get_mangled_name once it ensures that the variable being compiled is, in fact, global:

From ast.cpp, lines 58 through 63



58
59
60
61
62
63



void ast_lid::compile(const env_ptr& env, std::vector<instruction_ptr>& into) const {
    into.push_back(instruction_ptr(
        (this->env->is_global(id)) ?
            (instruction*) new instruction_pushglobal(this->env->get_mangled_name(id)) :
            (instruction*) new instruction_push(env->get_offset(id))));
}

Since all global functions now need to have mangled names, we run into a bit of a problem. What are the mangled names of (+), (-), and so on? We could continue to hardcode them as plus, minus, etc., but this can (and currently does!) lead to errors. Consider the following piece of code:

defn plus x y = { x + y }
defn main = { plus 320 6 }

We’ve hardcoded the mangled name of (+) to be plus. However, global_scope doesn’t know about this, so when the actual plus function gets translated, it also gets assigned the mangled name plus. The name is also overwritten in the llvm_context, which effectively means that (+) is now compiled to a call of the user-defined plus function. If we didn’t overwrite the name, we would’ve run into an assertion failure in this scenario anyway. In short, this example illustrates an important point: mangling information needs to be available outside of a global_scope. We don’t want to do this by having every function take in a global_scope to access the mangling information; instead, we’ll store the mangling information in a new mangler class, which global_scope will take as an argument. The new class is very simple:

From mangler.hpp, lines 5 through 11



 5
 6
 7
 8
 9
10
11



class mangler {
    private:
        std::map<std::string, int> occurence_count;

    public:
        std::string new_mangled_name(const std::string& str);
};

As with parse_driver, global_scope takes mangler by reference and stores a pointer:

From global_scope.hpp, line 50

50

        global_scope(mangler& m) : mng(&m) {}

The implementation of new_mangled_name doesn’t change, so I’m not going to show it here. With this new mangling information in hand, we can now correctly set the mangled names of binary operators:

From compiler.cpp, lines 22 through 27



22
23
24
25
26



void compiler::add_binop_type(binop op, type_ptr type) {
    auto name = mng.new_mangled_name(op_action(op));
    global_env->bind(op_name(op), std::move(type), visibility::global);
    global_env->set_mangled_name(op_name(op), name);
}

Wait a moment, what’s a compiler? Let’s talk about that next.

A Top-Level Class

Now that we’ve moved name mangling out of global_scope, we have to put it somewhere. The same goes for global definition group and the file manager that are given to parse_driver. The two classes make use of the other data, but they don’t own it. That’s why they take it by reference, and store it as a pointer. They’re just temporarily allowed access.

So, what should be the owner of all of these disparate components? Thus far, that has been the main function, or the utility functions that it calls out to. However, this is sloppy: we have related data and operations on it, but we don’t group them into an object. We can group all of the components of our compiler into a compiler object, and leave main.cpp with exception printing code.

The definition of the compiler class begins with all of the data structures that we use in the process of compilation:

From compiler.hpp, lines 12 through 20



12
13
14
15
16
17
18
19
20



    private:
        file_mgr file_m;
        definition_group global_defs;
        parse_driver driver;
        type_env_ptr global_env;
        type_mgr type_m;
        mangler mng;
        global_scope global_scp;
        llvm_context ctx;

There’s a loose ordering to these fields. In C++, class members are initialized in the order they are declared; we therefore want to make sure that fields that are depended on by other fields are initialized first. Otherwise, I tried to keep the order consistent with the conceptual path of the code through the compiler.

Parsing happens first, so we begin with parse_driver, which needs a file_manager (to populate with line information) and a definition_group (to receive the global definitions from the parser).
We then proceed to typechecking, for which we use a global type_env_ptr (to define the built-in functions and constructors) and a type_mgr (to manage the assignments of type variables).
Once a program is typechecked, we transform it, eliminating local function definitions and lambda functions. This is done by storing newly-emitted global functions into the global_scope, which requires a mangler to generate new names for the target functions.
Finally, to generate LLVM IR, we need our llvm_context class.

The methods of the compiler are arranged similarly:

From compiler.hpp, lines 22 through 31



22
23
24
25
26
27
28
29
30
31



        void add_default_types();
        void add_binop_type(binop op, type_ptr type);
        void add_default_function_types();
        void parse();
        void typecheck();
        void translate();
        void compile();
        void create_llvm_binop(binop op);
        void generate_llvm();
        void output_llvm(const std::string& into);

The methods go as follows:

add_default_types adds the built-in types to the global_env. At this point, these types only include Int.
add_binop_type adds a single binary operator to the global type environment. We saw its implementation earlier: it deals with both binding a type, and setting a mangled name.
add_default_types adds the types for each binary operator.
parse, typecheck, translate and compile all do exactly what they say. In this case, compilation refers to creating G-machine instructions.
create_llvm_binop creates an internal function that forces the evaluation of its two arguments, and actually applies the given binary operator. Recall that the (+) in user code constructs a call to this function, but leaves it unevaluated until it’s needed.
generate_llvm converts all the definitions in global_scope, which are at this point compiled into G-machine instructions, into LLVM IR.
output_llvm contains all the code to actually generate an object file from the LLVM IR.

These functions are mostly taken from part 12’s main.cpp, and adjusted to use the compiler’s members rather than local definitions or arguments. You should compare part 12’s main.cpp file with the compiler.cpp file that we end up with at the end of this post.

Next, we have the compiler’s constructor, and its operator(). The latter, analogously to our parsing driver, will trigger the compilation process. Their implementations are straightforward:

From compiler.cpp, lines 131 through 145



131
132
133
134
135
136
137
138
139
140
141
142
143
144
145



compiler::compiler(const std::string& filename)
    : file_m(), global_defs(), driver(file_m, global_defs, filename),
      global_env(new type_env), type_m(), mng(), global_scp(mng), ctx() {
    add_default_types();
    add_default_function_types();
}

void compiler::operator()(const std::string& into) {
    parse();
    typecheck();
    translate();
    compile();
    generate_llvm();
    output_llvm(into);
}

We also add a couple of methods to give external code access to some of the compiler’s data structures. I omit their (trivial) implementations, but they have the following signatures:

From compiler.hpp, lines 35 through 36

35
36

        file_mgr& get_file_manager();
        type_mgr& get_type_manager();

With all the compilation code tucked into our new compiler class, main becomes very simple. We also finally get to use our exception pretty printing code:

From main.cpp, lines 11 through 27



11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27



int main(int argc, char** argv) {
    if(argc != 2) {
        std::cerr << "please enter a file to compile." << std::endl;
        exit(1);
    }
    compiler cmp(argv[1]);

    try {
        cmp("program.o");
    } catch(unification_error& err) {
        err.pretty_print(std::cerr, cmp.get_file_manager(), cmp.get_type_manager());
    } catch(type_error& err) {
        err.pretty_print(std::cerr, cmp.get_file_manager());
    } catch (compiler_error& err) {
        err.pretty_print(std::cerr, cmp.get_file_manager());
    }
}

With this, we complete our transition to a compiler object. All that’s left is to clean up the code style.

Keeping Things Private

Hand-writing or generating hundreds of trivial getters and setters for the fields of a data class (which is standard in the world of Java) seems absurd to me. So, for most of this project, I stuck with structs, rather than classes. But this is not a good policy to apply everywhere. I still think it makes sense to make data structures like ast and type public-by-default; however, I don’t think that way about classes like type_mgr, llvm_context, type_env, and env. All of these have information that we should never be accessing directly. Some guard this information with assertions. In short, it should be protected.

For most classes, the changes are mechanical. For instance, we can make type_env a class simply by changing its declaration, and marking all of its functions public. This requires a slight refactoring of a line that used its parent field. Here’s what it used to be (in context):

From main.cpp, lines 57 through 60



57
58
59
60



    for(auto& defn : group.defs_defn) {
        auto& function = defn.second->into_global(scope);
        function.body->env->parent->set_mangled_name(defn.first, function.name);
    }

And here’s what it is now:

From compiler.cpp, lines 55 through 58



55
56
57
58



    for(auto& defn : global_defs.defs_defn) {
        auto& function = defn.second->into_global(global_scp);
        defn.second->env->set_mangled_name(defn.first, function.name);
    }

Rather than traversing the chain of environments from the body of the definition, we just use the definition’s own env_ptr. This is cleaner and more explicit, and it helps us not use the private members of type_env!

The deal with env is about as simple. We just make it and its two descendants classes, and mark their methods and constructors public. The same goes for global_scope. To make type_mgr a class, we have to add a new method: lookup. Here’s its implementation:

From type.cpp, lines 81 through 85



81
82
83
84
85



type_ptr type_mgr::lookup(const std::string& var) const {
    auto types_it = types.find(var);
    if(types_it != types.end()) return types_it->second;
    return nullptr;
}

It’s used in type_var::print as follows:

From type.cpp, lines 28 through 35



28
29
30
31
32
33
34
35



void type_var::print(const type_mgr& mgr, std::ostream& to) const {
    auto type = mgr.lookup(name);
    if(type) {
        type->print(mgr, to);
    } else {
        to << name;
    }
}

We can’t use resolve here because it takes (and returns) a type_ptr. If we make it take a type*, it won’t be able to return its argument if it’s already resolved. If we allow it to return type*, we won’t have an owning reference. We also don’t want to duplicate the method just for this one call. Notice, though, how similar type_var::print/lookup and resolve are in terms of execution.

The change for llvm_context requires a little more work. Right now, ctx.builder is used a lot in instruction.cpp. Since we don’t want to forward each of the LLVM builder methods, and since it feels weird to make llvm_context extend llvm::IRBuilder, we’ll just provide a getter for the builder field. The same goes for module:

From llvm_context.hpp, lines 46 through 47

46
47

        llvm::IRBuilder<>& get_builder();
        llvm::Module& get_module();

Here’s what some of the code from instruction.cpp looks like now:

From instruction.cpp, lines 144 through 145

144
145

        case PLUS: result = ctx.get_builder().CreateAdd(left_int, right_int); break;
        case MINUS: result = ctx.get_builder().CreateSub(left_int, right_int); break;

Right now, the ctx field of the llvm_context (which contains the llvm::LLVMContext) is only externally used to create instances of llvm::BasicBlock. We’ll add a proxy method for this functionality:

From llvm_context.cpp, lines 174 through 176

174
175
176

BasicBlock* llvm_context::create_basic_block(const std::string& name, llvm::Function* f) {
    return BasicBlock::Create(ctx, name, f);
}

Finally, instruction_pushglobal needs to access the llvm::Function instances that we create in the process of compilation. We add a new get_custom_function method to support this, which automatically prefixes the function name with f_, much like create_custom_function:

From llvm_context.cpp, lines 292 through 294

292
293
294

llvm_context::custom_function& llvm_context::get_custom_function(const std::string& name) {
    return *custom_functions.at("f_" + name);
}

I think that’s enough. If we chose to turn more compiler data structures into classes, I think we would’ve quickly drowned in one-line getter and setter methods.

That’s all for the cleanup! We’ve added locations and more errors to the compiler, stopped throwing 0 in favor of proper exceptions or assertions, made name mangling more reasonable, fixed a bug with accidentally shadowing default functions, organized our compilation process into a compiler class, and made more things into classes. In the next post, I hope to tackle strings and Input/Output. I also think that implementing modules would be a good idea, though at the moment I don’t know too much on the subject. I hope you’ll join me in my future writing!

Appendix: Optimization

When I started working on the compiler after the previous post, I went a little overboard. I started working on optimizing the generated programs, but eventually decided I wasn’t doing a good enough [note: I think authors should feel a certain degree of responsibility for the content they create. If I do something badly, somebody else trusts me and learns from it, who knows how much damage I've done. I try not to do damage.

If anyone reads what I write, anyway! ] job to present it to others, and scrapped that part of the compiler altogether. I’m not sure if I will try again in the near future. But, if you’re curious about optimization, here are a few avenues I’ve explored or thought about:

Unboxing numbers. Right now, numbers are allocated and garbage collected just like the rest of the graph nodes. This is far from ideal. We could use pointers to represent numbers, by tagging their most significant bits on 64-bit CPUs. Rather than allocating a node, the runtime will just cast a number to a pointer, tag it, and push it on the stack.
Converting enumeration data types to numbers. If no constructor of a data type takes any arguments, then the tag uniquely identifies each constructor. Combined with unboxed numbers, this can save unnecessary allocations and memory accesses.
Special treatment for global constants. It makes sense for global functions to be converted into LLVM functions, but the same is not the case for constants. [note: Yeah, yeah, a constant is just a nullary function. Get out of here with your pedantry! ] We can find a way to initialize global constants once, which would save some work. To make more constants suitable for this, we could employ monomorphism restriction.
Optimizing stack operations. If you read through the LLVM IR we produce, you can see a lot of code that peeks at something twice, or pops-then-pushes the same value, or does other absurd things. LLVM isn’t aware of the semantics of our stacks, but perhaps we could write an optimization pass to deal with some of the more blatant instances of this issue.

If you attempt any of these, let me know how it goes, please!

Daniel's Blog