I think tokenizing and parsing are boring. The thing is, looking at syntax is a pretty shallow measure of how interesting a language is. It’s like the cover of a book. Every language has one, and it so happens that to make our “book”, we need to start with making the cover. But the content of the book is what matters, and that’s where we’ve arrived now. We must make decisions about our language, and give meaning to programs written in it. But before we can give our programs meaning, we need to make sense of the current domain of programs that we receive from our parser. Let’s consider a few wonderful examples.

defn main = { plus 320 6 }
defn plus x y = { x + y }

This is a valid program, as far as we’re concerned. But are all programs we get from the parser valid? See for yourself:

data Bool = { True, False }
defn main = { 3 + True }

Obviously, that’s not right. The parser accepts it - this matches our grammar. But giving meaning to this program is not easy, since we have no clear way of adding 3 and some data type. Similarly:

defn main = { 1 2 3 4 5 }

What is this? It’s a sequence of applications, starting with 1 2. Numbers are not functions. Their type is Int, not Int -> a. I can’t even think of a type numbers would need to have for this program to be valid (though perhaps one could come up with one).

Before we give meaning to programs in our language, we’ll need to toss away the ones that don’t make sense. To do so, we will use type checking. During the process of type checking, we will collect information about various parts of our abstract syntax trees, classifying them by the types of values they create. Using this information, we’ll be able to throw away blatantly incorrect programs.

Basic Type Checking

You may have noticed in the very first post that I have chosen to avoid polymorphism. This will simplify our type checking algorithm. If a more robust algorithm is desired, take a look at the Hindley-Milner type system. Personally, I enjoyed this section of Write You a Haskell, which I used to sanity check this very post.

Let’s start with the types of constants - those are pretty obvious. The constant 3 is an integer, and we shall mark it as such: 3 :: Int. Variables seem like the natural next step, but they’re fairly different. Without outside knowledge, we can’t tell what type a variable x is. If we know more information, like the fact that x was declared to be an integer, we can instead say that. This tells us that throughout type checking we’ll have to keep some kind of record of names and their associated types.

Next, let’s take a look at functions, which are admittedly more interesting than the previous two examples. I’m not talking about the case of seeing something like a function name f. This is the same as the variable case - we don’t even know it’s a function unless there is context, and if there is context, then that context is probably the most useful information we have. I’m talking about something like the application of a function to another value, in the form f x. In this case, we know that for the program to be valid, f must have the type a -> b, a function from something to something. Furthermore, since f is being applied to x, the type of x (let’s call it c) must be the same as the type a.

Our rules are getting more complicated. In order to check that they hold, we will use unification. This means that we’ll be creating various equations (such as “the type of f is equal to a -> b”), and finding substitutions to solve these equations. If we’re unable to find a solution to our equations, the program is invalid and we throw it away.

Basic Examples

Let’s try an example. We’ll try to determine the type of the following expression, and extract any other information from this expression that we might use later.

foo 320 6

In out parse tree, this will be represented as (foo 320) 6, meaning that the outermost application will be at the top. Let’s assume we know only that foo is defined.

To figure out the type of the application, we will need to know the types of the thing being applied, and the thing that serves as the argument. The latter is easy: the type of 6 is Int. Before we proceed into the left child of the application, there’s one more piece of information we can deduce: since foo 320 is applied to an argument, it has to be of type a -> b.

Let’s proceed to the left child. It’s another application, this time of foo to 320. Again, the right child is simple: the type of 320 is Int. Again, we know that foo has to have a type c -> d (we’re using different variable names since the types of foo and foo 320 are not the same).

Now, we need to combine the pieces of information that we have. Since foo :: c -> d, we know that its first parameter must be of type c. We also know that its first parameter is of type Int. The only way for both of these to be true is for c = Int. This also tells us that foo :: Int -> d. Finally, since foo has now been applied to its first argument, we know that the foo 320 :: d.

We’ve done what we can from this innermost application; it’s time to return to the outermost one. We now know that the left child is of type d, and that it also has to be of type a -> b. The only way for this to be true is for d = a -> b. So, foo 320 is a function from a to b. Again, we can conclude the first parameter has to be of type a. We also know that the first parameter is of type Int. Clearly, this means that a = Int. After the application, we know that the whole expression has some type b.

Let’s revisit what we know about foo. Last time we checked in on it, foo was of type Int -> d. But since we know that d = a -> b, and that a = Int, we can now say that foo :: Int -> Int -> b.

We haven’t found any issues with the expression, and we learned some new information about the type of foo. Awesome!

Let’s apply this to a simplified example from the beginning of this post. Let’s check the type of:

1 2

Once again, the application is what we see first. The right child of the application, just like in the previous example, is Int. We also kno that since 1 is being applied as a function, its type must be a -> b. However, we also know that the left child, being a number, is also of type Int! There’s no way to combine a -> b with Int, and thus, there is no solution we can find for the type of 1 2. This means our program is invalid. We can toss it away, give an error, and exit.

Some Notation

If you go to the Wikipedia page on the Hindley-Milner type system, you will see quite a lot of symbols and greek letters. This is a good thing, because the way that I presented to you the rules for figuring out types of expressions is very verbose. You have to read several paragraphs of text, and that’s only for the first three logical rules! If you’re anything like me, you want to be able to read just the important parts, and with some notation, I’ll be able to show you these important parts concisely, while continuing to explain the rules in detail in paragraphs of text.

Let’s start with inference rules. An inference rule is an expression in the form:

A1AnB1Bm \frac{A_1 \ldots A_n} {B_1 \ldots B_m}

This reads, “given that the premises A1A_1 through AnA_n are true, it holds that the conclusions B1B_1 through BmB_m are true”.

For example, we can have the following inference rule:

if it’s cold, I wear a jacketit’s coldI wear a jacket \frac {\text{if it's cold, I wear a jacket} \quad \text{it's cold}} {\text{I wear a jacket}}

Since you wear a jacket when it’s cold, and it’s cold, we can conclude that you will wear a jacket.

When talking about type systems, it’s common to represent a type with τ\tau. The letter, which is the greek character “tau”, is used as a placeholder for some concrete type. It’s kind of like a template, to be filled in with an actual value. When we plug in an actual value into a rule containing τ\tau, we say we are instantiating it. Similarly, we will use ee to serve as a placeholder for an expression (matched by our AaddA_{add} grammar rule from part 2). Next, we have the typing relation, written as e:τe:\tau. This says that “expression ee has the type τ\tau”.

Alright, this is enough to get us started with some typing rules. Let’s start with one for numbers. If we define nn to mean “any expression that is a just a number, like 3, 2, 6, etc.”, we can write the typing rule as follows:

n:Int \frac{}{n : \text{Int}}

There’s nothing above the line - there are no premises that are needed for a number to have the type Int.

Now, let’s move on to the rule for function application:

e1:τ1τ2e2:τ1e1  e2:τ2 \frac {e_1 : \tau_1 \rightarrow \tau_2 \quad e_2 : \tau_1} {e_1 \; e_2 : \tau_2}

This rule includes everything we’ve said before: the thing being applied has to have a function type (τ1τ2\tau_1 \rightarrow \tau_2), and the expression the function is applied to has to have the same type τ1\tau_1 as the left type of the function.

It’s the variable rule that forces us to adjust our notation. Our rules don’t take into account the context that we’ve already discussed, and thus, we can’t bring in any outside information. Let’s fix that! It’s convention to use the symbol Γ\Gamma for the context. We then add notation to say, “using the context Γ\Gamma, we can deduce that ee has type τ\tau”. We will write this as Γe:τ\Gamma \vdash e : \tau.

But what is our context? We can think of it as a mapping from variable names to their known types. We can represent such a mapping using a set of pairs in the form x:τx : \tau, where xx represents a variable name.

Since Γ\Gamma is just a regular set, we can write x:τΓx : \tau \in \Gamma, meaning that in the current context, it is known that xx has the type τ\tau.

Let’s update our rules with this new addition.

The integer rule just needs a slight adjustment:

Γn:Int \frac{}{\Gamma \vdash n : \text{Int}}

The same is true for the application rule:

Γe1:τ1τ2Γe2:τ1Γe1  e2:τ2 \frac {\Gamma \vdash e_1 : \tau_1 \rightarrow \tau_2 \quad \Gamma \vdash e_2 : \tau_1} {\Gamma \vdash e_1 \; e_2 : \tau_2}

And finally, we can represent the variable rule:

x:τΓΓx:τ \frac{x : \tau \in \Gamma}{\Gamma \vdash x : \tau}

In these three expressions, we’ve captured all of the rules that we’ve seen so far. It’s important to know that these rules leave out the process of unification altogether: we use unification to find types that satisfy the rules.

Checking Case Expressions

So far, we’ve only checked types of numbers, applications, and variables. Our language has more than that, though!

Binary operators are by far the simplest to extend our language with; We can simply say that (+) is a function, Int -> Int -> Int, and x+y is the same as (+) x y. This way, we simply translate operators into function application, and the same typing rules apply.

Next up, we have case expressions. This is one of the two places where we will introduce new variables into the context, and also a place where we will need several rules.

Let’s first take a look at the whole case expression rule:

Γe:τmatcht(τ,pi)=biΓ,biei:τcΓcase  e  of  {(p1,e1)(pn,en)}:τc \frac {\Gamma \vdash e : \tau \quad \text{matcht}(\tau, p_i) = b_i \quad \Gamma,b_i \vdash e_i : \tau_c} {\Gamma \vdash \text{case} \; e \; \text{of} \; \{ (p_1,e_1) \ldots (p_n, e_n) \} : \tau_c }

This is a lot more complicated than the other rules we’ve seen, and we’ve used some notation that we haven’t seen before. Let’s take this step by step:

  1. e:τe : \tau, in this case, means that the expression between case and of, is of type τ\tau.
  2. matcht(τ,pi)=bi\text{matcht}(\tau, p_i) = b_i means that the pattern pip_i can match a value of type τ\tau, producing additional type pairs bib_i. We need bib_i because a pattern such as Cons x xs will introduce new type information, namely x:Int\text{x} : \text{Int} and xs:List\text{xs} : \text{List}.
  3. Γ,biei:τc\Gamma,b_i \vdash e_i : \tau_c means that each individual branch can be deduced to have the type τc\tau_c, using the previously existing context Γ\Gamma, with the addition of bib_i, the new type information.
  4. Finally, the conclusion is that the case expression, if all the premises are met, is of type τc\tau_c.

For completeness, let’s add rules for matcht(τ,pi)=bi\text{matcht}(\tau, p_i) = b_i. We’ll need two: one for the “basic” pattern, which always matches the value and binds a variable to it, and one for a constructor pattern, that matches a constructor and its parameters.

Let’s define vv to be a variable name in the context of a pattern. For the basic pattern:

matcht(τ,v)={v:τ} \frac {} {\text{matcht}(\tau, v) = \{v : \tau \}}

For the next rule, let’s define cc to be a constructor name. The rule for the constructor pattern, then, is:

Γc:τ1...τnτmatcht(τ,c  v1...vn)={v1:τ1,...,vn:τn} \frac {\Gamma \vdash c : \tau_1 \rightarrow ... \rightarrow \tau_n \rightarrow \tau} {\text{matcht}(\tau, c \; v_1 ... v_n) = \{ v_1 : \tau_1, ..., v_n : \tau_n \}}

This rule means that whenever we have a pattern in the form of a constructor applied to nn variable names, if the constructor takes nn arguments of types τ1\tau_1 through τn\tau_n, then the each variable will have a corresponding type.

We didn’t include lambda expressions in our syntax, and thus we won’t need typing rules for them, so it actually seems like we’re done with the first draft of our type rules.

Implementation

Let’s work towards some code. Before we write anything down though, let’s get a definition of what a “type” is, in the context of our type checker. Let’s say a type is one of 3 things:

  1. A “base type”, like Int, Bool, or List.
  2. A type that’s a function from one type to another.
  3. A placeholder / type variable (like the kind we used for type inference).

We represent a plceholder type with a unique string, such as “a”, or “b”, and this makes our placeholder type class very similar to the base type class.

From type.hpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#pragma once
#include <memory>
#include <map>

struct type {
    virtual ~type() = default;
};

using type_ptr = std::shared_ptr<type>;

struct type_var : public type {
    std::string name;

    type_var(std::string n)
        : name(std::move(n)) {}
};

struct type_base : public type {
    std::string name;

    type_base(std::string n) 
        : name(std::move(n)) {}
};

struct type_arr : public type {
    type_ptr left;
    type_ptr right;

    type_arr(type_ptr l, type_ptr r)
        : left(std::move(l)), right(std::move(r)) {}
};

struct type_mgr {
    int last_id = 0;
    std::map<std::string, type_ptr> types;

    std::string new_type_name();
    type_ptr new_type();
    type_ptr new_arrow_type();

    void unify(type_ptr l, type_ptr r);
    type_ptr resolve(type_ptr t, type_var*& var);
    void bind(const std::string& s, type_ptr t);
};

As you can see, we also declared a type_mgr, or type manager class. This class will keep the state used for generating more placeholder type names, as well as the information about which placeholder type is mapped to what. We gave it 3 methods:

To fit its original purpose, we also give the manager class the methods new_type_name, and two convenience methods to create placeholder types, new_type (in the form a) and new_arrow_type (in the form a->b).

Let’s take a look at the implementation now:

From type.cpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include "type.hpp"
#include <sstream>
#include <algorithm>

std::string type_mgr::new_type_name() {
    int temp = last_id++;
    std::string str = "";

    while(temp != -1) {
        str += (char) ('a' + (temp % 26));
        temp = temp / 26 - 1;
    }

    std::reverse(str.begin(), str.end());
    return str;
}

type_ptr type_mgr::new_type() {
    return type_ptr(new type_var(new_type_name()));
}

type_ptr type_mgr::new_arrow_type() {
    return type_ptr(new type_arr(new_type(), new_type()));
}

type_ptr type_mgr::resolve(type_ptr t, type_var*& var) {
    type_var* cast;

    var = nullptr;
    while((cast = dynamic_cast<type_var*>(t.get()))) {
        auto it = types.find(cast->name);

        if(it == types.end()) {
            var = cast;
            break;
        }
        t = it->second;
    }

    return t;
}

void type_mgr::unify(type_ptr l, type_ptr r) {
    type_var* lvar;
    type_var* rvar;
    type_arr* larr;
    type_arr* rarr;
    type_base* lid;
    type_base* rid;

    l = resolve(l, lvar);
    r = resolve(r, rvar);

    if(lvar) {
        bind(lvar->name, r);
        return;
    } else if(rvar) {
        bind(rvar->name, l);
        return;
    } else if((larr = dynamic_cast<type_arr*>(l.get())) &&
            (rarr = dynamic_cast<type_arr*>(r.get()))) {
        unify(larr->left, rarr->left);
        unify(larr->right, rarr->right);
        return;
    } else if((lid = dynamic_cast<type_base*>(l.get())) &&
            (rid = dynamic_cast<type_base*>(r.get()))) {
        if(lid->name == rid->name) return;
    }

    throw 0;
}

void type_mgr::bind(const std::string& s, type_ptr t) {
    type_var* other = dynamic_cast<type_var*>(t.get());
    
    if(other && other->name == s) return;
    types[s] = t;
}

Here, new_type_name is actually pretty boring. My goal was to generate type names like a, then b, eventually getting to z, and then move on to aa. This provides is with an endless stream of placeholder types.

Time for the interesting functions. resolve keeps trying dynamic_cast to a type variable, and if that succeeds, then either:

  1. It’s a type variable that’s already been set to something, in which case we try resolve the thing it was set to (t = it->second)
  2. It’s a type variable that hasn’t been set to something. We set var to it (the caller will use this info), and stop our resolution loop (break).

In unify, we start by calling resolve - we don’t want to accidentally compare a and b (and try to bind a to b) when a is already bound to something else (like Int).

From resolve, we will have lvar and rvar set to something not NULL if l or r were type variables that haven’t yet been bound (we defined resolve to behave this way). So, if one of the variables is not NULL, we try to bind it.

Next, unify checks if both types are either base types or arrow types. If they’re base types, it compares their names, and if they’re arrow types, it recursively unifies their children. We return in all cases when unification succeeds, and then throw an exception (currently 0) if all the cases fell thorugh, and thus, unification failed.

Finally, bind places the type we’re binding to into the types map, but not before it checks that the type we’re binding is the same as the string we’re binding it to (since, again, a=a is not a useful equation).

We now have a unification algorithm, but we still need to implement our rules. Our rules usually include three things: an environment Γ\Gamma, an expression ee, and a type τ\tau. We will represent this as a method on ast, our struct for an expression tree. This method will take an environment and return a type.

Environment

How should we implement our environment? Conceptually, an environment maps a name string to a type. So naively, we can implement this simply using an std::map. But observe that we only extend the environment in one case so far: a case expression. In a case expression, we have the base envrionment Γ\Gamma, and for each branch, we extend it with the bindings produced by the pattern match. Each branch receives a modified copy of the original environment, one that doesn’t see the effects of the other branches.

Using our naive approach, we’d create a new std::map for each branch that’s a copy of the original environment, and place into it the new pairs. But this means we’ll need to copy a map for each branch of the pattern!

There’s a better way. We structure our environment like a linked list. Each node in the linked list contains an std::map. When we encounter a new scope (such as in a case branch), we create a new such node, and add all variables introduced in this scope to that node’s map. We make it point to our current environment. Then, we pass around the new node as the environment.

When we look up a variable name, we first look in this node we created. If we don’t find the variable we’re looking for, we move on to the next node. The benefit of this is that we won’t be re-creating a map for each branch, and just creating a node with the changes. Let’s implement exactly that. the header:

From env.hpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#pragma once
#include <map>
#include "type.hpp"

struct type_env {
    std::map<std::string, type_ptr> names;
    type_env const* parent = nullptr;

    type_env(type_env const* p)
        : parent(p) {}
    type_env() : type_env(nullptr) {}

    type_ptr lookup(const std::string& name) const;
    void bind(const std::string& name, type_ptr t);
    type_env scope() const;
};

And the source file:

From env.cpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#include "env.hpp"

type_ptr type_env::lookup(const std::string& name) const {
    auto it = names.find(name);
    if(it != names.end()) return it->second;
    if(parent) return parent->lookup(name);
    return nullptr;
}

void type_env::bind(const std::string& name, type_ptr t) {
    names[name] = t;
}

type_env type_env::scope() const {
    return type_env(this);
}

Nothing should seem too surprising. Of note is the fact that we’re not using smart pointers for scope, and that the child we create during the call would be invalid if the parent goes out of scope / is released. We’re gearing this towards creating new environments on the stack, and we’ll take care not to let a parent go out of scope before the child.

Typechecking Expressions

At last, it’s time to declare a new type checking method. We start with with a signature inside ast:

virtual type_ptr typecheck(type_mgr& mgr, const type_env& env) const;

We also implement the matchp\text{matchp} function as a method match on pattern with the following signature:

virtual void match(type_ptr t, type_mgr& mgr, type_env& env) const;

We declare this in every subclass of ast. Let’s take a look at the implementation now:

From ast.cpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include "ast.hpp"

std::string op_name(binop op) {
    switch(op) {
        case PLUS: return "+";
        case MINUS: return "-";
        case TIMES: return "*";
        case DIVIDE: return "/";
    }
    throw 0;
}

type_ptr ast_int::typecheck(type_mgr& mgr, const type_env& env) const {
    return type_ptr(new type_base("Int"));
}

type_ptr ast_lid::typecheck(type_mgr& mgr, const type_env& env) const {
    return env.lookup(id);
}

type_ptr ast_uid::typecheck(type_mgr& mgr, const type_env& env) const {
    return env.lookup(id);
}

type_ptr ast_binop::typecheck(type_mgr& mgr, const type_env& env) const {
    type_ptr ltype = left->typecheck(mgr, env);
    type_ptr rtype = right->typecheck(mgr, env);
    type_ptr ftype = env.lookup(op_name(op));
    if(!ftype) throw 0;

    type_ptr return_type = mgr.new_type();
    type_ptr arrow_one = type_ptr(new type_arr(rtype, return_type));
    type_ptr arrow_two = type_ptr(new type_arr(ltype, arrow_one));

    mgr.unify(arrow_two, ftype);
    return return_type;
}

type_ptr ast_app::typecheck(type_mgr& mgr, const type_env& env) const {
    type_ptr ltype = left->typecheck(mgr, env);
    type_ptr rtype = right->typecheck(mgr, env);

    type_ptr return_type = mgr.new_type();
    type_ptr arrow = type_ptr(new type_arr(rtype, return_type));
    mgr.unify(arrow, ltype);
    return return_type;
}

type_ptr ast_case::typecheck(type_mgr& mgr, const type_env& env) const {
    type_ptr case_type = of->typecheck(mgr, env);
    type_ptr branch_type = mgr.new_type();

    for(auto& branch : branches) {
        type_env new_env = env.scope();
        branch->pat->match(case_type, mgr, new_env);
        type_ptr curr_branch_type = branch->expr->typecheck(mgr, new_env);
        mgr.unify(branch_type, curr_branch_type);
    }

    return branch_type;
}

void pattern_var::match(type_ptr t, type_mgr& mgr, type_env& env) const {
    env.bind(var, t);
}

void pattern_constr::match(type_ptr t, type_mgr& mgr, type_env& env) const {
    type_ptr constructor_type = env.lookup(constr);
    if(!constructor_type) throw 0;

    for(int i = 0; i < params.size(); i++) {
        type_arr* arr = dynamic_cast<type_arr*>(constructor_type.get());
        if(!arr) throw 0;

        env.bind(params[i], arr->left);
        constructor_type = arr->right;
    }

    mgr.unify(t, constructor_type);
    type_base* result_type = dynamic_cast<type_base*>(constructor_type.get());
    if(!result_type) throw 0;
}

The typecheck implementation for ast_int, ast_lid, and ast_uid is quite intuitive. The type of a number is always Int, and varible names we simply look up in the environment.

For ast_binop and ast_app, we check the types of the children first, and store their resulting types. In the case of binop, we assume that we have the type of the binary operator (such as +) in the environment, and throw if it isn’t. Then, we know that the operator has to be a function of at least two arguments, with the types of the left and right children of the application. We also know its actual type, as it’s in the environment. We unify these two types, and then return.

In the case of app, we know that the left side (the thing being applied), has to be a function from the type of the right child. We also know its computed type. Once again, we unify the two, and then return.

The type checking for case offloads most of the work onto the match method on patterns. We start by computing the type of the expression we’re matching. Then, for each branch, we create a new scope, and populate it with the new bindings created by the pattern match (match does this). Once we have a new environment, we typecheck the body of the branch. Finally, we unify the type of the branch’s body with the previous body types (since branches of the case expression have to have the same type).

Let’s take a look at match now. The match method for a variable pattern is very simple: it always introduces a new variable bindings, and stops there. The match method for a constructor pattern is more interesting. First, it looks up the constructor that the pattern is trying to match. This needs to exist, so if we don’t find a type for it, we throw. Next, for each variable name in the pattern, we know that there has to be a corresponding parameter in the constructor. Because of this, we cast the constructor type to an function type (throwing if it isn’t one), and create a new mapping from the variable name to the left side (parameter) of the function type. We then move on to examine the right side of the function type, and the remaining variables of the pattern.

Once we have gone through the parameters, what remains should be the type of the thing the constructor creates. Not only that, but the remaining type should match the type of the value the pattern is being matched against. To ensure this, we unify the two types.

There’s only one thing that can still go wrong. The value and the pattern can both be a partial application. For ease of implementation, we will not allow this case. Thus, we make sure the final type is a base type, and throw otherwise.

Typechecking Definitions

This looks good, but we’re not done yet. We can type check expressions, but our program ins’t made up of expressions. Rather, it’s made up of definitions. Further, we can’t look at the definitions in isolation. Consider these two functions:

defn double x = { x + x }
defn quadruple x = { double (double x) }

Assuming we have an environment containing x when we typecheck the body of double, our algorithm will work out fine. But what about quadruple? It needs to know what double is, or at least that it exists.

We could also envision two mutually recursive functions. Let’s assume we have the functions eq and if in global scope. We can write two functions, even and odd:

defn even x = { if (eq x 0) True (odd (x-1)) }
defn odd x = { if (eq x 0) False (even (n-1)) } 

odd needs to know about even, and even needs to know about odd. Thus, before we do any checking, we need to populate a global environment with some type for each function we declare. We will use what we know about the function for our initial declaration: if the function takes two parameters, its type will be a -> b -> c. If it takes one parameter, its type will be a -> b. What’s more, though, is that we need to make sure that the function’s parameters are passed in the environment when checking its body, and that these parameters’ types are the same as the placeholder types in the function’s “declaration”.

We’ll typecheck the program in two passes. First, we’ll go through each definition, and add any functions it declares to the global scope. Then, we will go through each definition again, and, if it’s a function, typecheck its body using the previously fleshed out global scope.

We’ll add two functions, typecheck_first and typecheck_second corresponding to these two stages. Their signatures:

virtual void typecheck_first(type_mgr& mgr, type_env& env);
virtual void typecheck_second(type_mgr& mgr, const type_env& env) const;

Furthermore, in the definition_defn, we will keep an std::vector of type_ptr, in which the first element is the type of the last parameter, and so on. We switch around the order of arguments because we build up the a -> b -> ... type signature from the right (-> is right associative), and thus we’ll be creating the types right-to-left, too. We also add a type_ptr field which holds the function’s return type. We keep these two things in the definition_defn so that they persist between the two typechecking stages: we want to use the types from the first stage to aid in checking the body in the second stage.

Here’s the code for the implementation:

From definition.cpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include "ast.hpp"

void definition_defn::typecheck_first(type_mgr& mgr, type_env& env) {
    return_type = mgr.new_type();
    type_ptr full_type = return_type;

    for(auto it = params.rbegin(); it != params.rend(); it++) {
        type_ptr param_type = mgr.new_type();
        full_type = type_ptr(new type_arr(param_type, full_type));
        param_types.push_back(param_type);
    }

    env.bind(name, full_type);
}

void definition_defn::typecheck_second(type_mgr& mgr, const type_env& env) const {
    type_env new_env = env.scope();
    auto param_it = params.begin();
    auto type_it = param_types.rbegin();

    while(param_it != params.end() && type_it != param_types.rend()) {
        new_env.bind(*param_it, *type_it);
        param_it++;
        type_it++;
    }

    type_ptr body_type = body->typecheck(mgr, new_env);
    mgr.unify(return_type, body_type);
}

void definition_data::typecheck_first(type_mgr& mgr, type_env& env) {
    type_ptr return_type = type_ptr(new type_base(name));

    for(auto& constructor : constructors) {
        type_ptr full_type = return_type;

        for(auto& type_name : constructor->types) {
            type_ptr type = type_ptr(new type_base(type_name));
            full_type = type_ptr(new type_arr(type, full_type));
        }

        env.bind(constructor->name, full_type);
    }
}

void definition_data::typecheck_second(type_mgr& mgr, const type_env& env) const {
    // Nothing
}

And finally, our updated main:

From main.cpp, entire file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include "ast.hpp"
#include "parser.hpp"
#include "type.hpp"

void yy::parser::error(const std::string& msg) {
    std::cout << "An error occured: " << msg << std::endl;
}

extern std::vector<definition_ptr> program;

void typecheck_program(const std::vector<definition_ptr>& prog) {
    type_mgr mgr;
    type_env env;

    type_ptr int_type = type_ptr(new type_base("Int")); 
    type_ptr binop_type = type_ptr(new type_arr(
                int_type,
                type_ptr(new type_arr(int_type, int_type))));

    env.bind("+", binop_type);
    env.bind("-", binop_type);
    env.bind("*", binop_type);
    env.bind("/", binop_type);

    for(auto& def : prog) {
        def->typecheck_first(mgr, env);
    }

    for(auto& def : prog) {
        def->typecheck_second(mgr, env);
    }
}

int main() {
    yy::parser parser;
    parser.parse();
    typecheck_program(program);
    std::cout << program.size() << std::endl;
}

Notice that we manually add the types for our binary operators to the environment.

Let’s run our project on a few examples. On our two “bad” examples, we get the very eloquent error:

terminate called after throwing an instance of 'int'
[2]    9776 abort (core dumped)  ./a.out < bad2.txt

That’s what we get for throwing 0.

So far, our program has thrown in 100% of cases. Let’s verify it actually accepts valid programs! We’ll try our very first example from today, as well as these two:

From works2.txt, entire file
defn add x y = { x + y }
defn double x = { add x x }
defn main = { double 163 }
From works3.txt, entire file
data List = { Nil, Cons Int List }
defn length l = {
    case l of {
        Nil -> { 0 }
        Cons x xs -> { 1 + length xs }
    }
}

All of our examples print the number of declarations in the program, which means they don’t throw 0. And so, we have typechecking! Before we look at how we will execute our source code, we will slow down and make quality of life improvements in our codebase in Part 4 - Small Improvements.