Compiling a Functional Language Using C++, Part 4

Table of Contents

We’ve done quite a big push in the previous post. We defined type rules for our language, implemented unification, and then implemented unification to enforce these rules for our program. The post was pretty long, and even then we weren’t able to fit quite everything into it.

For instance, we threw 0 whenever an error occured. This gives us no indication of what actually went wrong. We should probably define an exception class, one that can contain information about the error, and report it to the user.

Also, when there’s no error, our compiler doesn’t really tell us anything at all about the code besides the number of definitions. We probably want to see the types of these definitions, or at least some intermediate information. At the very least, we want to have the ability to see this information.

Finally, we have no build system. We are creating more and more source files, and so far (unless you’ve taken initiative), we’ve been compiling them by hand. We want to only compile source files that have changed, and we want to have a standard definition of how to build our program.

Printing Syntax Trees

Let’s start by printing the trees we get from our parser. This is long overdue - we had no way to verify the structure of what our parser returned to us since Part 2. We’ll print the trees top-down, with the children of a node indent one block further than the node itself. For this, we’ll make a new virtual function with the signature:

virtual void print(int indent, std::ostream& to) const;

We’ll include a similar printing function into our pattern struct, too:

virtual void print(std::ostream& to) const;

Let’s take a look at the implementation. For ast_int, ast_lid, and ast_uid:

From ast.cpp, lines 19 through 22



19
20
21
22



void ast_int::print(int indent, std::ostream& to) const {
    print_indent(indent, to);
    to << "INT: " << value << std::endl;
}

From ast.cpp, lines 28 through 31



28
29
30
31



void ast_lid::print(int indent, std::ostream& to) const {
    print_indent(indent, to);
    to << "LID: " << id << std::endl;
}

From ast.cpp, lines 37 through 40



37
38
39
40



void ast_uid::print(int indent, std::ostream& to) const {
    print_indent(indent, to);
    to << "UID: " << id << std::endl;
}

With ast_binop things get a bit more interesting. We call print recursively on the children of the binop node:

From ast.cpp, lines 46 through 51



46
47
48
49
50
51



void ast_binop::print(int indent, std::ostream& to) const {
    print_indent(indent, to);
    to << "BINOP: " << op_name(op) << std::endl;
    left->print(indent + 1, to);
    right->print(indent + 1, to);
}

The same idea for ast_app:

From ast.cpp, lines 67 through 72



67
68
69
70
71
72



void ast_app::print(int indent, std::ostream& to) const {
    print_indent(indent, to);
    to << "APP:" << std::endl;
    left->print(indent + 1, to);
    right->print(indent + 1, to);
}

Finally, just like ast_case::typecheck called pattern::match, ast_case::print calls pattern::print:

From ast.cpp, lines 84 through 93



84
85
86
87
88
89
90
91
92
93



void ast_case::print(int indent, std::ostream& to) const {
    print_indent(indent, to);
    to << "CASE: " << std::endl;
    for(auto& branch : branches) {
        print_indent(indent + 1, to);
        branch->pat->print(to);
        to << std::endl;
        branch->expr->print(indent + 2, to);
    }
}

We follow the same implementation strategy for patterns, but we don’t need indentation, or recursion:

From ast.cpp, lines 115 through 117

115
116
117

void pattern_var::print(std::ostream& to) const {
    to << var;
}

From ast.cpp, lines 123 through 128



123
124
125
126
127
128



void pattern_constr::print(std::ostream& to) const {
    to << constr;
    for(auto& param : params) {
        to << " " << param;
    }
}

In main, let’s print the bodies of each function we receive from the parser:

From main.cpp, lines 47 through 56



47
48
49
50
51
52
53
54
55
56



    for(auto& definition : program) {
        definition_defn* def = dynamic_cast<definition_defn*>(definition.get());
        if(!def) continue;

        std::cout << def->name;
        for(auto& param : def->params) std::cout << " " << param;
        std::cout << ":" << std::endl;

        def->body->print(1, std::cout);
    }

Printing Types

Types are another thing that we want to be able to inspect, so let’s add a similar print method to them:

virtual void print(const type_mgr& mgr, std::ostream& to) const;

We need the type manager so we can follow substitutions. The implementation is simple enough:

From type.cpp, lines 6 through 24



 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24



void type_var::print(const type_mgr& mgr, std::ostream& to) const {
    auto it = mgr.types.find(name);
    if(it != mgr.types.end()) {
        it->second->print(mgr, to);
    } else {
        to << name;
    }
}

void type_base::print(const type_mgr& mgr, std::ostream& to) const {
    to << name;
}

void type_arr::print(const type_mgr& mgr, std::ostream& to) const {
    left->print(mgr, to);
    to << " -> (";
    right->print(mgr, to);
    to << ")";
}

Let’s also print out the types we infer. We’ll make it a separate loop at the bottom of the typecheck_program function, because it’s mostly just for debugging purposes:

From main.cpp, lines 34 through 38



34
35
36
37
38



    for(auto& pair : env.names) {
        std::cout << pair.first << ": ";
        pair.second->print(mgr, std::cout);
        std::cout << std::endl;
    }

Fixing Bugs

We actually discover not one, but two bugs in our implementation thanks to the output we get from printing trees and types. Observe the output for works3.txt:

length l:
  CASE:
    Nil
      INT: 0
*: Int -> (Int -> (Int))
+: Int -> (Int -> (Int))
-: Int -> (Int -> (Int))
/: Int -> (Int -> (Int))
Cons: List -> (Int -> (List))
Nil: List
length: List -> (Int)
2

First, we’re missing the Cons branch. The culprit is parser.y, specifically this line:

    : branches branch { $$ = std::move($1); $1.push_back(std::move($2)); }

Notice that we move our list of branches out of $1. However, when we push_back, we use $1 again. That’s wrong! We need to push_back to $$ instead:

From parser.y, line 110

110

    : branches branch { $$ = std::move($1); $$.push_back(std::move($2)); }

Next, observe that Cons has type List -> Int -> List. That’s not right, since Int comes first in our definition. The culprit is this fragment of code:

        for(auto& type_name : constructor->types) {
            type_ptr type = type_ptr(new type_base(type_name));
            full_type = type_ptr(new type_arr(type, full_type));
        }

Remember how we build the function type backwards in Part 3? We have to do the same here. We replace the fragment with the proper reverse iteration:

From definition.cpp, lines 37 through 40



37
38
39
40



        for(auto it = constructor->types.rbegin(); it != constructor->types.rend(); it++) {
            type_ptr type = type_ptr(new type_base(*it));
            full_type = type_ptr(new type_arr(type, full_type));
        }

Throwing Exceptions

Throwing 0 is never a good idea. Such an exception doesn’t contain any information that we may find useful in debugging, nor any information that would benefit the users of the compiler. Instead, let’s define our own exception classes, and throw them instead. We’ll make two:

From error.hpp, entire file



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21



#pragma once
#include <exception>
#include "type.hpp"

struct type_error : std::exception {
    std::string description;

    type_error(std::string d)
        : description(std::move(d)) {}

    const char* what() const noexcept override;
};

struct unification_error : public type_error {
    type_ptr left;
    type_ptr right;

    unification_error(type_ptr l, type_ptr r)
        : left(std::move(l)), right(std::move(r)), 
            type_error("failed to unify types") {}
};

Only one function needs to be implemented, and it’s pretty boring:

From error.cpp, entire file

1
2
3
4
5

#include "error.hpp"

const char* type_error::what() const noexcept {
    return "an error occured while checking the types of the program";
}

It’s time to throw these instead of 0. Let’s take a look at the places we do so.

First, we throw 0 in type.cpp, in the type_mgr::unify method. This is where our unification_error comes in. The error will contain the two types that we failed to unify, which we will later report to the user:

From type.cpp, line 91

91

    throw unification_error(l, r);

Next up, we have a few throws in ast.cpp. The first is in op_string, but we will simply replace it with return "??", which will be caught later on (either way, the case expression falling through would be a compiler bug, since the user has no way of providing an invalid binary operator). The first throw we need to address is in ast_binop::typecheck, in the case that we don’t find a type for a binary operator. We report this directly:

From ast.cpp, line 57

57

    if(!ftype) throw type_error(std::string("unknown binary operator ") + op_name(op));

We will introduce a new exception into ast_case::typecheck. Previously, we simply pass the type of the expression to be case analyzed into the pattern matching method. However, since we don’t want case analysis on functions, we ensure that the type of the expression is type_base. If not, we report this:

From ast.cpp, lines 107 through 110



107
108
109
110



    case_type = mgr.resolve(case_type, var);
    if(!dynamic_cast<type_base*>(case_type.get())) {
        throw type_error("attempting case analysis of non-data type");
    }

The next exception is in pattern_constr::match. It occurs when the pattern has a constructor we don’t recognize, and that’s exactly what we report:

From ast.cpp, lines 132 through 134

132
133
134

    if(!constructor_type) {
        throw type_error(std::string("pattern using unknown constructor ") + constr);
    }

The next exception occurs in a loop, when we bind types for each of the constructor pattern’s variables. We throw when we are unable to cast the remaining constructor type to a type_arr. Conceptually, this means that the pattern wants to apply the constructor to more parameters than it actually takes:

From ast.cpp, line 138

138

        if(!arr) throw type_error("too many parameters in constructor pattern");

We remove the last throw at the bottom of pattern_constr::match. This is because once unification succeeds, we know that the return type of the pattern is a base type since we know the type of the case expression is a base type (we know this because we added that check to ast_case::typecheck).

Finally, let’s catch and report these exceptions. We could do it in typecheck_program, but I think doing so in main is neater. Since printing types requires a type_mgr, we’ll move the declarations of both type_mgr and type_env to the top of main, and pass them to typecheck_program as parameters. Then, we can surround the call to typecheck_program with try/catch:

From main.cpp, lines 57 through 69



57
58
59
60
61
62
63
64
65
66
67
68
69



    try {
        typecheck_program(program, mgr, env);
    } catch(unification_error& err) {
        std::cout << "failed to unify types: " << std::endl;
        std::cout << "  (1) \033[34m";
        err.left->print(mgr, std::cout);
        std::cout << "\033[0m" << std::endl;
        std::cout << "  (2) \033[32m";
        err.right->print(mgr, std::cout);
        std::cout << "\033[0m" << std::endl;
    } catch(type_error& err) {
        std::cout << "failed to type check program: " << err.description << std::endl;
    }

We use some ANSI escape codes to color the types in the case of a unification error.

Setting up CMake

We will set up CMake as our build system. This would be extremely easy if not for Flex and Bison, but it’s not hard either way. We start with the usual:

From CMakeLists.txt, lines 1 through 2

1
2

cmake_minimum_required(VERSION 3.1)
project(compiler)

Next, we want to set up Flex and Bison. CMake provides two commands for this:

From CMakeLists.txt, lines 4 through 5

4
5

find_package(BISON)
find_package(FLEX)

We now have access to commands that allow us to tell CMake about our parser and tokenizer (or scanner). We use them as follows:

From CMakeLists.txt, lines 6 through 12



 6
 7
 8
 9
10
11
12



bison_target(parser
    ${CMAKE_CURRENT_SOURCE_DIR}/parser.y
    ${CMAKE_CURRENT_BINARY_DIR}/parser.cpp
    COMPILE_FLAGS "-d")
flex_target(scanner
    ${CMAKE_CURRENT_SOURCE_DIR}/scanner.l
    ${CMAKE_CURRENT_BINARY_DIR}/scanner.cpp)

We also want CMake to know that the scanner needs to parser’s header file in order to compile. We add this dependency:

From CMakeLists.txt, line 13

13

add_flex_bison_dependency(scanner parser)

Finally, we add our source code to a CMake target. We use the BISON_parser_OUTPUTS and FLEX_scanner_OUTPUTS to pass in the source files generated by Flex and Bison.

From CMakeLists.txt, lines 15 through 23



15
16
17
18
19
20
21
22
23



add_executable(compiler
    ast.cpp ast.hpp definition.cpp
    env.cpp env.hpp
    type.cpp type.hpp
    error.cpp error.hpp
    ${BISON_parser_OUTPUTS}
    ${FLEX_scanner_OUTPUTS}
    main.cpp
)

Almost there! parser.cpp will be generated in the build directory during an out-of-source build, and so will parser.hpp. When building, parser.cpp will try to look for ast.hpp, and main.cpp will look for parser.hpp. We want them to be able to find each other, so we add both the source directory and the build (binary) directory to the list of include directories:

From CMakeLists.txt, lines 24 through 25

24
25

target_include_directories(compiler PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_include_directories(compiler PUBLIC ${CMAKE_CURRENT_BINARY_DIR})

That’s it for CMake! Let’s try our build:

cmake -S . -B build
cd build && make -j8

Updated Code

We’ve made a lot of changes to the codebase, and I’ve only shown snippets of the code so far. If you’de like to see the whole codebase, you can go to my site’s git repository and check out the code so far.

Having taken this little break, it’s time for our next push. We will define how our programs will be evaluated in Part 5 - Execution.

Daniel's Blog