In part 8, we wrote some pretty interesting programs in our little language. We successfully expressed arithmetic and recursion. But there’s one thing that we cannot express in our language without further changes: an if statement.

Suppose we didn’t want to add a special if/else expression into our language. Thanks to lazy evaluation, we can express it using a function:

defn if c t e = {
    case c of {
        True -> { t }
        False -> { e }
    }
}

But an issue still remains: so far, our compiler remains monomorphic. That is, a particular function can only have one possible type for each one of its arguments. With our current setup, something like this [note: In a polymorphically typed language, the inner if would just evaluate to False, and the whole expression to 3. ]

if (if True False True) 11 3

This is because, for this to work, both of the following would need to hold (borrowing some of our notation from the typechecking post):

if:BoolIntIntInt \text{if} : \text{Bool} \rightarrow \text{Int} \rightarrow \text{Int} \rightarrow \text{Int} if:BoolBoolBoolBool \text{if} : \text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool}

But using our rules so far, such a thing is impossible, since there is no way for Int\text{Int} to be unified with Bool\text{Bool}. We need a more powerful set of rules to describe our program’s types.

Hindley-Milner Type System

One such powerful set of rules is the Hindley-Milner type system, which we have previously alluded to. In fact, the rules we came up with were already very close to Hindley-Milner, with the exception of two: generalization and instantiation. It’s been quite a while since the last time we worked on typechecking, so I’m going to present a table with these new rules, as well as all of the ones that we [note: The rules aren't quite the same as the ones we used earlier; note that σ\sigma is used in place of τ\tau in the first rule, for instance. These changes are slight, and we'll talk about how the rules work together below. ] I will also give a quick summary of each of these rules.

Rule Name and Description
x:σΓΓx:σ \frac {x:\sigma \in \Gamma} {\Gamma \vdash x:\sigma} Var: If the variable xx is known to have some polymorphic type σ\sigma then an expression consisting only of that variable is of that type.
Γe1:τ1τ2Γe2:τ1Γe1  e2:τ2 \frac {\Gamma \vdash e_1 : \tau_1 \rightarrow \tau_2 \quad \Gamma \vdash e_2 : \tau_1} {\Gamma \vdash e_1 \; e_2 : \tau_2} App: If an expression e1e_1, which is a function from monomorphic type τ1\tau_1 to another monomorphic type τ2\tau_2, is applied to an argument e2e_2 of type τ1\tau_1, then the result is of type τ2\tau_2.
Γe:τmatcht(τ,pi)=biΓ,biei:τcΓcase  e  of  {(p1,e1)(pn,en)}:τc \frac {\Gamma \vdash e : \tau \quad \text{matcht}(\tau, p_i) = b_i \quad \Gamma,b_i \vdash e_i : \tau_c} {\Gamma \vdash \text{case} \; e \; \text{of} \; \{ (p_1,e_1) \ldots (p_n, e_n) \} : \tau_c } Case: This rule is not part of Hindley-Milner, and is specific to our language. If the expression being case-analyzed is of type τ\tau and each branch (pi,ei)(p_i, e_i) is of the same type τc\tau_c when the pattern pip_i works with type τ\tau producing extra bindings bib_i, the whole case expression is of type τc\tau_c.
Γe:σσσΓe:σ \frac{\Gamma \vdash e : \sigma' \quad \sigma' \sqsubseteq \sigma} {\Gamma \vdash e : \sigma} Inst (New): If type σ\sigma is an instantiation (or specialization) of type σ\sigma' then an expression of type σ\sigma' is also an expression of type σ\sigma.
Γe:σα∉free(Γ)Γe:α.σ \frac {\Gamma \vdash e : \sigma \quad \alpha \not \in \text{free}(\Gamma)} {\Gamma \vdash e : \forall \alpha . \sigma} Gen (New): If an expression has a type with free variables, this rule allows us generalize it to allow all possible types to be used for these free variables.

Here, there is a distinction between different forms of types. First, there are monomorphic types, or monotypes, τ\tau, which are types such as Int\text{Int}, IntBool\text{Int} \rightarrow \text{Bool}, aba \rightarrow b and so on. These types are what we’ve been working with so far. Each of them represents one (hence, “mono-"), concrete type. This is obvious in the case of Int\text{Int} and IntBool\text{Int} \rightarrow \text{Bool}, but for aba \rightarrow b things are slightly less clear. Does it really represent a single type, if we can put an arbtirary thing for aa? The answer is “yes”! Although aa is not currently known, it stands in place of another monotype, which is yet to be determined.

So, how do we represent polymorphic types, like that of if\text{if}? We borrow some more notation from mathematics, and use the “forall” quantifier:

if:a  .  Boolaaa \text{if} : \forall a \; . \; \text{Bool} \rightarrow a \rightarrow a \rightarrow a

This basically says, “the type of if\text{if} is Boolaaa\text{Bool} \rightarrow a \rightarrow a \rightarrow a for all possible aa”. This new expression using “forall” is what we call a type scheme, or a polytype σ\sigma. For simplicity, we only allow “forall” to be at the front of a polytype. That is, expressions like ab  .  bba \rightarrow \forall b \; . \; b \rightarrow b are not valid polytypes as far as we’re concerned.

It’s key to observe that only some of the typing rules in the above table use polytypes (σ\sigma). Whereas expressions consisting of a single variable can be polymorphically typed, this is not true for function applications and case expressions. In fact, according to our rules, there is no way to introduce a polytype anywhere into our system!

The reason for this is that we only allow polymorphism at certain locations. In the Hindley-Milner type system, this is called Let-Polymorphism, which means that only in let/in expressions can variables or expressions be given a polymorphic type. We, on the other hand, do not (yet) have let/in expressions, so our polymorphism is further limited: only functions (and data type constructors) can be polymorphically typed.

Let’s talk about what Inst does, and what “\sqsubseteq” means. First, why don’t we go ahead and write the formal inference rule for \sqsubseteq:

τ={αiτi}τβi∉free(α1...αn  .  τ)α1...αn  .  τβ1...βm  .  τ \frac {\tau'=\{\alpha_i \mapsto \tau_i \}\tau \quad \beta_i \not \in \text{free}(\forall \alpha_1...\forall \alpha_n \; . \; \tau)} {\forall \alpha_1 ... \forall \alpha_n \; . \; \tau \sqsubseteq \forall \beta_1 ... \forall \beta_m \; . \; \tau'}

In my opinion, this is one of the more confusing inference rules we have to deal with in Hindley-Milner. In principle, though, it’s not too hard to understand. σσ\sigma' \sqsubseteq \sigma says “σ\sigma' is more general than σ\sigma”. Alternatively, we can interpret it as “σ\sigma is an instance of σ\sigma'”.

What does it mean for one polytype to be more general than another? Intuitively, we might say that a  .  aa\forall a \; . \; a \rightarrow a is more general than IntInt\text{Int} \rightarrow \text{Int}, because the former type can represent the latter, and more. Formally, we define this in terms of substitutions. A substitution {ατ}\{\alpha \mapsto \tau \} replaces a variable α\alpha with a monotype τ\tau. If we can use a substitution to convert one type into another, then the first type (the one on which the substitution was performed) is more general than the resulting type. In our previous example, since we can apply the substitution {aInt}\{a \mapsto \text{Int}\} to get IntInt\text{Int} \rightarrow \text{Int} from a  .  aa\forall a \; . \; a \rightarrow a, the latter type is more general; using our notation, a  .  aaIntInt\forall a \; . \; a \rightarrow a \sqsubseteq \text{Int} \rightarrow \text{Int}.

That’s pretty much all that the rule says, really. But what about the whole thing with β\beta and free\text{free}? The reason for that part of the rule is that, in principle, we can substitute polytypes into polytypes. However, we can’t do this using {ασ}\{ \alpha \mapsto \sigma \}. Consider, for example:

{ab  .  bb}  Boolaa=?Bool(b  .  bb)b  .  bb \{ a \mapsto \forall b \; . \; b \rightarrow b \} \; \text{Bool} \rightarrow a \rightarrow a \stackrel{?}{=} \text{Bool} \rightarrow (\forall b \; . \; b \rightarrow b) \rightarrow \forall b \; . \; b \rightarrow b

Hmm, this is not good. Didn’t we agree that we only want quantifiers at the front of our types? Instead, to make that substitution, we only substitute the monotype bbb \rightarrow b, and then add the b\forall b at the front. But to do this, we must make sure that bb doesn’t occur anywhere in the original type a  .  Boolaa\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a (otherwise we can accidentally generalize too much). So then, our concrete inference rule is as follows:

Bool(bb)bb={a(bb)}  Boolaab∉free(a  .  Boolaa)=a  .  Boolaab  .  Bool(bb)bb \frac { \begin{gathered} \text{Bool} \rightarrow (b \rightarrow b) \rightarrow b \rightarrow b =\{a \mapsto (b \rightarrow b) \} \; \text{Bool} \rightarrow a \rightarrow a \\ b \not \in \text{free}(\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a) = \varnothing \end{gathered} } {\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a \sqsubseteq \forall b \; . \; \text{Bool} \rightarrow (b \rightarrow b) \rightarrow b \rightarrow b}

In the above rule we:

  1. Replaced aa with bbb \rightarrow b, getting rid of aa in the quantifier.
  2. Observed that bb is not a free variable in the original type, and thus can be generalized.
  3. Added the generalization for bb to the front of the resulting type.

Now, Inst just allows us to perform specialization / substitution as many times as we need to get to the type we want.

A New Typechecking Algorithm

Alright, now we have all these rules. How does this change our typechecking algorithm? How about the following:

  1. To every declared function, assign the type ayza \rightarrow … \rightarrow y \rightarrow z, where [note: Of course, there can be more or less than 25 arguments to any function. This is just a generalization; we use as many input types as are needed. ] and zz is the function’s return type.
  2. We typecheck each declared function, using the Var, Case, App, and Inst rules.
  3. Whatever type variables we don’t fill in, we assume can be filled in with any type, and use the Gen rule to sprinkle polymorphism where it is needed.

Maybe this is enough. Let’s go through an example. Suppose we have three functions:

defn if c t e = {
    case c of {
        True -> { t }
        False -> { e }
    }
}
defn testOne = { if True False True }
defn testTwo = { if True 0 1 }

If we go through and typecheck them top-to-bottom, the following happens:

  1. We start by assuming if:abcd\text{if} : a \rightarrow b \rightarrow c \rightarrow d, testOne:e\text{testOne} : e and testTwo:f\text{testTwo} : f.
  2. We look at if. We infer the type of c to be Bool\text{Bool}, and thus, a=Boola = \text{Bool}. We also infer that b=cb = c, since they occur in two branches of the same case expression. Finally, we infer that c=dc = d, since whatever the case expression returns becomes the return value of the function. Thus, we come out knowing that if:Boolbbb\text{if} : \text{Bool} \rightarrow b \rightarrow b \rightarrow b.
  3. Now, since we never figured out bb, we use Gen: if:b  .  Boolbbb\text{if} : \forall b \; . \; \text{Bool} \rightarrow b \rightarrow b \rightarrow b. Like we’d want, if works with all types, as long as both its inputs are of the same type.
  4. When we typecheck the body of testOne, we use Inst to turn the above type for if into a single, monomorphic instance. Then, type inference proceeds as before, and all is well.
  5. When we typecheck the body of testTwo, we use Inst again, instantiating a new monotype, and all is well again.

So far, so good. But what if we started from the bottom, and went to the top?

  1. We start by assuming if:abcd\text{if} : a \rightarrow b \rightarrow c \rightarrow d, testOne:e\text{testOne} : e and testTwo:f\text{testTwo} : f.
  2. We look at testTwo. We infer that a=Boola = \text{Bool} (since we pass in True to if). We also infer that b=Intb = \text{Int}, and that c=Intc = \text{Int}. Not yet sure of the return type of if, this is where we stop. We are left with the knowledge that f=df = d (because the return type of if is the return type of testTwo), but that’s about it. Since ff is no longer free, we don’t generalize, and conclude that testTwo:f\text{testTwo} : f.
  3. We look at testOne. We infer that a=Boola = \text{Bool} (already known). We also infer that b=Boolb = \text{Bool}, and that c=Boolc = \text{Bool}. But wait a minute! This is not right. We are back to where we started, with a unification error!

What went wrong? I claim that we typechecked the functions that used if before we typechecked if itself, which led us to infer a less-than-general type for if. This less-than-general type was insufficient to correctly check the whole program.

To address this, we enforce a particular order of type inference on our declaration, guided by dependencies between functions. Haskell, which has to deal with a similar issue, has a section in the 2010 report on this. In short:

  1. We find the [note: A transitive closure of a vertex in a graph is the list of all vertices reachable from that original vertex. Check out the Wikipedia page on this. ] of the function dependencies. We define a function ff to be dependent on another function gg if ff calls gg. The transitive closure will help us find functions that are related indirectly. For instance, if gg also depends on hh, then the transitive closure of ff will include hh, even if ff directly doesn’t use hh.
  2. We isolate groups of mutually dependent functions. If ff depends on gg and gg depends ff, they are placed in one group. We then construct a dependency graph of these groups.
  3. We compute a topological order of the group graph. This helps us typecheck the dependencies of functions before checking the functions themselves. In our specific case, this would ensure we check if first, and only then move on to testOne and testTwo. The order of typechecking within a group does not matter, as long as we generalize only after typechecking all functions in a group.
  4. We typecheck the function groups, and functions within them, following the above topological order.

To find the transitive closure of a graph, we can use Warshall’s Algorithm. This algorithm, with complexity O(V3)O(|V|^3), goes as follows: A,R(i)Bn×nR(0)Afor  k1  to  n  dofor  i1  to  n  dofor  j1  to  n  doR(k)[i,j]R(k1)[i,j]  or  R(k1)[i,k]  and  R(k1)[k,j]return  R(n) \begin{aligned} & A, R^{(i)} \in \mathbb{B}^{n \times n} \\ & \\ & R^{(0)} \leftarrow A \\ & \textbf{for} \; k \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\ & \quad \textbf{for} \; i \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\ & \quad \quad \textbf{for} \; j \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\ & \quad \quad \quad R^{(k)}[i,j] \leftarrow R^{(k-1)}[i,j] \; \textbf{or} \; R^{(k-1)}[i,k] \; \textbf{and} \; R^{(k-1)}[k,j] \\ & \textbf{return} \; R^{(n)} \end{aligned}

In the above notation, R(i)R^{(i)} is the iith matrix RR, and AA is the adjacency matrix of the graph in question. All matrices in the algorithm are from Bn×n\mathbb{B}^{n \times n}, the set of nn by nn boolean matrices. Once this algorithm is complete, we get as output a transitive closure adjacency matrix R(n)R^{(n)}. Mutually dependent functions will be pretty easy to isolate from this matrix. If R(n)[i,j]R^{(n)}[i,j] and R(n)[j,i]R^{(n)}[j,i], then the functions represented by vertices ii and jj depend on each other.

Once we’ve identified the groups, and [note: This might seem like a "draw the rest of the owl" situation, but it really isn't. We'll follow a naive algorithm for findings groups, and for translating function dependencies into group dependencies. This algorithm, in C++, will be presented later on. ] it is time to compute the topological order. For this, we will use Kahn’s Algorithm. The algorithm goes as follows:

Lempty listSset of all nodes with no incoming edgeswhile  S  is non-empty  doremove a node  n  from  Sadd  n  to the end of  Lfor each  node  m  with edge  e  from  n  to  m  doremove edge  e  from the graphif  m  has no other incoming edges  theninsert  m  into  Sif  the graph has edges  thenreturn  error(graph has at least once cycle)elsereturn  L(a topologically sorted order) \begin{aligned} & L \leftarrow \text{empty list} \\ & S \leftarrow \text{set of all nodes with no incoming edges} \\ & \\ & \textbf{while} \; S \; \text{is non-empty} \; \textbf{do} \\ & \quad \text{remove a node} \; n \; \text{from} \; S \\ & \quad \text{add} \; n \; \text{to the end of} \; L \\ & \quad \textbf{for each} \; \text{node} \; m \; \text{with edge} \; e \; \text{from} \; n \; \text{to} \; m \; \textbf{do} \\ & \quad \quad \text{remove edge} \; e \; \text{from the graph} \\ & \quad \quad \textbf{if} \; m \; \text{has no other incoming edges} \; \textbf{then} \\ & \quad \quad \quad \text{insert} \; m \; \text{into} \; S \\ & \\ & \textbf{if} \; \text{the graph has edges} \; \textbf{then} \\ & \quad \textbf{return} \; \text{error} \quad \textit{(graph has at least once cycle)} \\ & \textbf{else} \\ & \quad \textbf{return} \; L \quad \textit{(a topologically sorted order)} \end{aligned}

Note that since we’ve already isolated all mutually dependent functions into groups, our graph will never have cycles, and this algorithm will always succeed. Also note that since we start with nodes with no incoming edges, our list will begin with the groups that should be checked last. This is because a node with no incoming edges might (and probably does) still have outgoing edges, and thus depends on other functions / groups. Like in our successful example, we want to typecheck functions that are depended on first.

Implementation

Let’s start working on a C++ implementation of all of this now. First, I think that we should create a C++ class that will represent our function dependency graph. Let’s call it function_graph. I propose the following definition:

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
using function = std::string;

struct group {
    std::set<function> members;
};

using group_ptr = std::unique_ptr<group>;

class function_graph {
    using group_id = size_t;

    struct group_data {
        std::set<function> functions;
        std::set<group_id> adjacency_list;
        size_t indegree;
    };

    using data_ptr = std::shared_ptr<group_data>;
    using edge = std::pair<function, function>;
    using group_edge = std::pair<group_id, group_id>;

    std::map<function, std::set<function>> adjacency_lists;
    std::set<edge> edges;

    std::set<edge> compute_transitive_edges();
    void create_groups(
            const std::set<edge>&,
            std::map<function, group_id>&,
            std::map<group_id, data_ptr>&);
    void create_edges(
            std::map<function, group_id>&,
            std::map<group_id, data_ptr>&);
    std::vector<group_ptr> generate_order(
            std::map<function, group_id>&,
            std::map<group_id, data_ptr>&);
    
    public:
    std::set<function>& add_function(const function& f);
    void add_edge(const function& from, const function& to);
    std::vector<group_ptr> compute_order();
};

There’s a lot to unpack here. First of all, we create a type alias function that represents the label of a function in our graph. It is probably most convenient to work with std::string instances, so we settle for that. Next, we define a struct that will represent a single group of mutually dependent functions. Passing this struct by value seems wrong, so we’ll settle for a C++ unique_pt to help carry instances around.

Finally, we arrive at the definition of function_graph. Inside this class, we define a helper struct, group_data, which holds information about an individual group as it is being constructed. This information includes the group’s adjacency list and indegree (both used for Kahn’s topological sorting algorithm), as well as the set of functions in the group (which we will eventually return).

The adjacency_lists and edges fields are the meat of the graph representation. Both of the variables provide a different view of the same graph: adjacency_lists associates with every function a list of functions it depends on, while edges holds a set of tuples describing edges in the graph. Having more than one representation makes it more convenient for us to perform different operations on our graphs.

Next up are some internal methods that perform the various steps we described above:

Following these, we have three public function definitions:

Let’s start by looking at how to implement the internal methods. compute_transitive_edges is a very straightforward implementation of Warshall’s:

 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
std::set<function_graph::edge> function_graph::compute_transitive_edges() {
    std::set<edge> transitive_edges;
    transitive_edges.insert(edges.begin(), edges.end());
    for(auto& connector : adjacency_lists) {
        for(auto& from : adjacency_lists) {
            edge to_connector { from.first, connector.first };
            for(auto& to : adjacency_lists) {
                edge full_jump { from.first, to.first };
                if(transitive_edges.find(full_jump) != transitive_edges.end()) continue;

                edge from_connector { connector.first, to.first };
                if(transitive_edges.find(to_connector) != transitive_edges.end() &&
                        transitive_edges.find(from_connector) != transitive_edges.end())
                    transitive_edges.insert(std::move(full_jump));
            }
        }
    }
    return transitive_edges;
}

Next is create_groups. For each function, we iterate over all other functions. If the other function is mutually dependent with the first function, we add it to the same group. In the outer loop, we skip over functions that have already been added to the group. This is because [note: There is actually a slight difference between "mutual dependence" the way we defined it and "being in the same group", and it lies in the symmetric property of an equivalence relation. We defined a function to depend on another function if it calls that other function. Then, a recursive function depends on itself, but a non-recursive function does not, and therefore does not satisfy the symmetric property. However, as far as we're concerned, a function should be in a group with itself even if it's not recursive. Thus, the real equivalence relation we use is "in the same group as", and consists of "mutual dependence" extended with symmetry. ] is an equivalence relation, which means that if we already added a function to a group, all its group members were also already visited and added.

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
void function_graph::create_groups(
        const std::set<edge>& transitive_edges,
        std::map<function, group_id>& group_ids,
        std::map<group_id, data_ptr>& group_data_map) {
    group_id id_counter = 0;
    for(auto& vertex : adjacency_lists) {
        if(group_ids.find(vertex.first) != group_ids.end())
            continue;
        data_ptr new_group(new group_data);
        new_group->functions.insert(vertex.first);
        group_data_map[id_counter] = new_group;
        group_ids[vertex.first] = id_counter;
        for(auto& other_vertex : adjacency_lists) {
            if(transitive_edges.find({vertex.first, other_vertex.first}) != transitive_edges.end() &&
                    transitive_edges.find({other_vertex.first, vertex.first}) != transitive_edges.end()) {
                group_ids[other_vertex.first] = id_counter;
                new_group->functions.insert(other_vertex.first);
            }
        }
        id_counter++;
    }
}

Once groups have been created, we use their functions’ edges to create edges for the groups themselves, using create_edges. We avoid creating edges from a group to itself, to prevent unnecessary cycles. While constructing the edges, we also increment the relevant indegree counter.

46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
void function_graph::create_edges(
        std::map<function, group_id>& group_ids,
        std::map<group_id, data_ptr>& group_data_map) {
    std::set<std::pair<group_id, group_id>> group_edges;
    for(auto& vertex : adjacency_lists) {
        auto vertex_id = group_ids[vertex.first];
        auto& vertex_data = group_data_map[vertex_id];
        for(auto& other_vertex : vertex.second) {
            auto other_id = group_ids[other_vertex];
            if(vertex_id == other_id) continue;
            if(group_edges.find({vertex_id, other_id}) != group_edges.end())
                continue;
            group_edges.insert({vertex_id, other_id});
            vertex_data->adjacency_list.insert(other_id);
            group_data_map[other_id]->indegree++;
        }
    }
}

Finally, we apply Kahn’s algorithm to create a topological order in generate_order:

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
std::vector<group_ptr> function_graph::generate_order(
        std::map<function, group_id>& group_ids,
        std::map<group_id, data_ptr>& group_data_map) {
    std::queue<group_id> id_queue;
    std::vector<group_ptr> output;
    for(auto& group : group_data_map) {
        if(group.second->indegree == 0) id_queue.push(group.first);
    }

    while(!id_queue.empty()) {
        auto new_id = id_queue.front();
        auto& group_data = group_data_map[new_id];
        group_ptr output_group(new group);
        output_group->members = std::move(group_data->functions);
        id_queue.pop();

        for(auto& adjacent_group : group_data->adjacency_list) {
            if(--group_data_map[adjacent_group]->indegree == 0)
                id_queue.push(adjacent_group);
        }

        output.push_back(std::move(output_group));
    }

    return output;
}

These four steps are used in compute_order:

106
107
108
109
110
111
112
113
114
std::vector<group_ptr> function_graph::compute_order() {
    std::set<edge> transitive_edges = compute_transitive_edges();
    std::map<function, group_id> group_ids;
    std::map<group_id, data_ptr> group_data_map;

    create_groups(transitive_edges, group_ids, group_data_map);
    create_edges(group_ids, group_data_map);
    return generate_order(group_ids, group_data_map);
}

Let’s now look at the remaining two public definitions. First comes add_function, which creates an adjacency list for the function to be inserted (if one does not already exist), and returns a reference to the resulting list:

92
93
94
95
96
97
98
99
std::set<function>& function_graph::add_function(const function& f) {
    auto adjacency_list_it = adjacency_lists.find(f);
    if(adjacency_list_it != adjacency_lists.end()) {
        return adjacency_list_it->second;
    } else {
        return adjacency_lists[f] = { };
    }
}

We use this in add_edge, which straightforwardly creates an edge between two functions:

101
102
103
104
void function_graph::add_edge(const function& from, const function& to) {
    add_function(from).insert(to);
    edges.insert({ from, to });
}

With this, we can now properly order our typechecking. However, we are just getting started: there are still numerous changes we need to make to get our compiler to behave as we desire.

The first change is the least relevant, but will help clean up our code base in the presence of polymorphism: we will get rid of resolve, in both definitions and AST nodes. The reasons for this are twofold. First, [note: Recall that ast_case needs this information to properly account for the changes to the stack from when data is unpacked. ] This means that all the rest of the infrastructure we’ve written around preserving types is somewhat pointless. Second, when we call resolve, we’d now have to distinguish between type variables captured by “forall” and actual, undefined variables. That’s a lot of wasted work! To replace the now-removed type field, we make ast_case include a new member, input_type, which stores the type of the thing between case and of. Since ast_case requires its type to be a data type at the time of typechecking, we no longer need to resolve anything.

Next, we need to work in a step geared towards finding function calls (to determine dependencies). As we have noted in part 6, it’s pretty easy to tell apart calls to global functions from “local” ones. If we see that a variable was previously bound (perhaps as a function argument, or by a pattern in a case expression), we know for sure that it is not a global function call. Otherwise, if the variable isn’t bound anywhere in the function definition (it’s a free variable), it must refer to a global function. Then, we can traverse the function body, storing variables that are bound (but only within their scope), and noting references to variables we haven’t yet seen. To implement this, we can use a linked list, where each node refers to a particular scope, points to the scope enclosing it, and contains a list of variables…

Wait a minute, this is identical to type_env! There’s no reason to reimplement all this. But then, another question arises: do we throw away the type_env generated by the dependency-searching step? It seems wasteful, since we will eventually repeat this same work. Rather, we’ll re-use the same type_env instances in both this new step and typecheck. To do this, we will now store a pointer to a type_env in every AST node, and set this pointer during our first traversal of the tree. Indeed, this makes our type_env more like a symbol table. With this change, our new dependency-finding step will be implemented by the find_free function with the following signature:

void ast::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into);

Let’s take a look at how this will be implemented. The simplest case (as usual) is ast_int:

16
17
18
void ast_int::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into) {
    this->env = env;
}

In this case, we associate the type_env with the node, but don’t do anything else: a number is not a variable. A more interesting case is ast_lid:

33
34
35
36
void ast_lid::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into) {
    this->env = env;
    if(env->lookup(id) == nullptr) into.insert(id);
}

If a lowercase variable has not yet been bound to something, it’s free, and we store it. Somewhat counterintuitively, ast_uid behaves differently:

54
55
56
void ast_uid::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into) {
    this->env = env;
}

We don’t allow uppercase variables to be bound to anything outside of data type declarations, so we don’t care about uppercase free variables. Next up is ast_binop:

73
74
75
76
77
void ast_binop::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into) {
    this->env = env;
    left->find_free(mgr, env, into);
    right->find_free(mgr, env, into);
}

A binary operator can have free variables in the subexpressions on the left and on the right, and the above implementation reflects that. This is identical to the implementation of ast_app:

109
110
111
112
113
void ast_app::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into) {
    this->env = env;
    left->find_free(mgr, env, into);
    right->find_free(mgr, env, into);
}

Finally, ast_case requires the most complicated function (as usual):

142
143
144
145
146
147
148
149
150
void ast_case::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into) {
    this->env = env;
    of->find_free(mgr, env, into);
    for(auto& branch : branches) {
        type_env_ptr new_env = type_scope(env);
        branch->pat->insert_bindings(mgr, new_env);
        branch->expr->find_free(mgr, new_env, into);
    }
}

The type_scope function replaces the type_env::scope method, which cannot (without significant effort) operate on smart pointers. Importantly, we are using a new pattern method here, insert_bindings. This is because we split “introducing variables” and “typechecking variables” into two steps for patterns, as well. The implementation of insert_bindings for pattern_var is as follows:

230
231
232
void pattern_var::insert_bindings(type_mgr& mgr, type_env_ptr& env) const {
    env->bind(var, mgr.new_type());
}

A variable pattern always introduces the variable it is made up of. On the other hand, the implementation for pattern_constr is as follows:

245
246
247
248
249
void pattern_constr::insert_bindings(type_mgr& mgr, type_env_ptr& env) const {
    for(auto& param : params) {
        env->bind(param, mgr.new_type());
    }
}

All the variables of the pattern are placed into the environment. For now, we don’t worry about arity; this is the job of typechecking.

These changes are reflected in all instances of our typecheck function. First of all, typecheck no longer needs to receive a type_env parameter, since each tree node has a type_env_ptr. Furthermore, typecheck should no longer call bind, since this was already done by find_free. For example, ast_lid::typecheck will now use env::lookup:

38
39
40
type_ptr ast_lid::typecheck(type_mgr& mgr) {
    return env->lookup(id)->instantiate(mgr);
}

Don’t worry about instantiate for now; that’s coming up. Similarly to ast_lid, ast_case::typecheck will no longer introduce new bindings, but unify existing types via the pattern:

152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
type_ptr ast_case::typecheck(type_mgr& mgr) {
    type_var* var;
    type_ptr case_type = mgr.resolve(of->typecheck(mgr), var);
    type_ptr branch_type = mgr.new_type();

    for(auto& branch : branches) {
        branch->pat->typecheck(case_type, mgr, branch->expr->env);
        type_ptr curr_branch_type = branch->expr->typecheck(mgr);
        mgr.unify(branch_type, curr_branch_type);
    }

    input_type = mgr.resolve(case_type, var);
    if(!dynamic_cast<type_data*>(input_type.get())) {
        throw type_error("attempting case analysis of non-data type");
    }

    return branch_type;
}

The above implementation uses another new pattern method, typecheck. This method inherits the type checking functionality previously contained in pattern::match. Here’s the implementation for pattern_var:

234
235
236
void pattern_var::typecheck(type_ptr t, type_mgr& mgr, type_env_ptr& env) const {
    mgr.unify(env->lookup(var)->instantiate(mgr), t);
}

And here’s the implementation for pattern_constr:

251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
void pattern_constr::typecheck(type_ptr t, type_mgr& mgr, type_env_ptr& env) const {
    type_ptr constructor_type = env->lookup(constr)->instantiate(mgr);
    if(!constructor_type) {
        throw type_error(std::string("pattern using unknown constructor ") + constr);
    }

    for(auto& param : params) {
        type_arr* arr = dynamic_cast<type_arr*>(constructor_type.get());
        if(!arr) throw type_error("too many parameters in constructor pattern");

        mgr.unify(env->lookup(param)->instantiate(mgr), arr->left);
        constructor_type = arr->right;
    }

    mgr.unify(t, constructor_type);
}

So far, so good. However, for all of this to reach the main typechecking code, not only ast subclasses need to be updated, but also the definitions. Here things get more complicated, because definition_data and definition_defn are growing more and more apart. Previously, we had two typechecking steps: typecheck_first (which registered function names into the environment) and typecheck_second (which performed the actual typechecking). However, not only are these names not informative, but the algorithms for typechecking the two types of definition will soon have different numbers of “major” steps.

Let’s take a look at how we would typecheck data types. I propose the following steps:

  1. Iterate all declared data types, storing them into some kind of “known” list.
  2. Iterate again, and for each constructor of a type, verify that it refers to “known” types. Add valid constructors to the global environment as functions.

We don’t currently verify that types are “known”; A user could declare a list of Floobs, and never say what a Floob is. [note: Curiously, this flaw did lead to some valid programs being rejected. Since we had no notion of a "known" type, whenever data type constructors were created, every argument type was marked a "base" type; see this line if you're curious. This would cause pattern matching to fail on the tail of a list, with the "attempt to pattern match on non-data argument" error. ] (good luck constructing a value of a non-existent type), but a mature compiler should prevent this from happening.

On the other hand, here are the steps for function definitions:

  1. Find the free variables of each function to create the ordered list of groups as described above.
  2. Within each group, insert a general function type (like abca \rightarrow b \rightarrow c) into the environment for each function.
  3. Within each group (in the same pass) run typechecking (including polymorphism, using the rules as described above).

The two types of definitions further diverge when generating LLVM and compiling to G-machine instructions: data types immediately construct and insert their functions, and do not emit G-machine instructions, while functions generate G-machine instructions, declare prototypes, and emit LLVM in three distinct phases. Overall, there are virtually no similarities between the two data type declarations, and any inheritance of common functions starts to appear somewhat forced. To address this, we remove the definition class altogether, and sever the relationship between definition_data and definition_defn. The two now look as follows:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
struct definition_defn {
    std::string name;
    std::vector<std::string> params;
    ast_ptr body;

    type_env_ptr env;
    type_env_ptr var_env;
    std::set<std::string> free_variables;
    type_ptr full_type;
    type_ptr return_type;

    std::vector<instruction_ptr> instructions;

    llvm::Function* generated_function;

    definition_defn(std::string n, std::vector<std::string> p, ast_ptr b)
        : name(std::move(n)), params(std::move(p)), body(std::move(b)) {

    }

    void find_free(type_mgr& mgr, type_env_ptr& env);
    void insert_types(type_mgr& mgr);
    void typecheck(type_mgr& mgr);
    void compile();
    void declare_llvm(llvm_context& ctx);
    void generate_llvm(llvm_context& ctx);
};

using definition_defn_ptr = std::unique_ptr<definition_defn>;

struct definition_data {
    std::string name;
    std::vector<constructor_ptr> constructors;

    type_env_ptr env;

    definition_data(std::string n, std::vector<constructor_ptr> cs)
        : name(std::move(n)), constructors(std::move(cs)) {}

    void insert_types(type_mgr& mgr, type_env_ptr& env);
    void insert_constructors() const;
    void generate_llvm(llvm_context& ctx);
};

using definition_data_ptr = std::unique_ptr<definition_data>;

In definition_defn, the functions are arranged as follows:

In definition_data, the steps are significantly simpler:

While the last three methods of definition_defn remain unchanged save for the name, the implementations of the first three see some updates. First is find_free:

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
void definition_defn::find_free(type_mgr& mgr, type_env_ptr& env) {
    this->env = env;

    var_env = type_scope(env);
    return_type = mgr.new_type();
    full_type = return_type;

    for(auto it = params.rbegin(); it != params.rend(); it++) {
        type_ptr param_type = mgr.new_type();
        full_type = type_ptr(new type_arr(param_type, full_type));
        var_env->bind(*it, param_type);
    }

    body->find_free(mgr, var_env, free_variables);
}

First, to make sure we don’t pollute the global scope with function parameters, find_free creates a new environment var_env. Then, it stores into this new environment the function parameters, ensuring that the parameters of a function aren’t marked “free”. Concurrently, find_free constructs the “general” function type (used by insert_types). Once all the arguments have been bound, definition_defn::find_free makes a call to ast::find_free, which does the work of actually finding free variables.

Since the function type is created by find_free, insert_types has very little to do:

28
29
30
void definition_defn::insert_types(type_mgr& mgr) {
    env->bind(name, full_type);
}

Finally, typecheck, which no longer has to bind the function arguments to new types, is also fairly simple:

32
33
34
35
void definition_defn::typecheck(type_mgr& mgr) {
    type_ptr body_type = body->typecheck(mgr);
    mgr.unify(return_type, body_type);
}

Let’s move on to data types. In order to implement definition_data::insert_types, we need to store somewhere a list of all the valid type names. We do this by adding a new type_names field to type_env, and implementing the corresponding methods lookup_type:

11
12
13
14
15
16
type_ptr type_env::lookup_type(const std::string& name) const {
    auto it = type_names.find(name);
    if(it != type_names.end()) return it->second;
    if(parent) return parent->lookup_type(name);
    return nullptr;
}

And bind_type:

26
27
28
29
void type_env::bind_type(const std::string& type_name, type_ptr t) {
    if(lookup_type(type_name) != nullptr) throw 0;
    type_names[type_name] = t;
}

Note in the above snippets that we disallow redeclaring type names; declaring two data types (or other types) with the same name in our language will not be valid. In insert_types, we create a new data type and store it in the environment:

59
60
61
62
void definition_data::insert_types(type_mgr& mgr, type_env_ptr& env) {
    this->env = env;
    env->bind_type(name, type_ptr(new type_data(name)));
}

We then update insert_constructors to query the environment when creating constructor types, rather than blindly using new type_base(...) like before:

64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
void definition_data::insert_constructors() const {
    type_ptr return_type = env->lookup_type(name);
    type_data* this_type = static_cast<type_data*>(return_type.get());
    int next_tag = 0;

    for(auto& constructor : constructors) {
        constructor->tag = next_tag;
        this_type->constructors[constructor->name] = { next_tag++ };

        type_ptr full_type = return_type;
        for(auto it = constructor->types.rbegin(); it != constructor->types.rend(); it++) {
            type_ptr type = env->lookup_type(*it);
            if(!type) throw 0;
            full_type = type_ptr(new type_arr(type, full_type));
        }

        env->bind(constructor->name, full_type);
    }
}

The separation of data and function definitions must be reconciled with code going back as far as the parser. While previously, we populated a single, global vector of definitions called program, we can no longer do that. Instead, we’ll split our program into two maps, one for data types and one for functions. We use maps for convenience: the groups generated by our function graph refer to functions by name, and it would be nice to quickly look up the data the names refer to. Rather than returning such maps, we change our semantic actions to simply insert new data into one of two global maps. Below is a snippet that includes all the changes:

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
%type <std::vector<std::string>> lowercaseParams uppercaseParams
%type <std::vector<branch_ptr>> branches
%type <std::vector<constructor_ptr>> constructors
%type <ast_ptr> aAdd aMul case app appBase
%type <definition_data_ptr> data 
%type <definition_defn_ptr> defn
%type <branch_ptr> branch
%type <pattern_ptr> pattern
%type <constructor_ptr> constructor

%start program

%%

program
    : definitions { }
    ;

definitions
    : definitions definition { }
    | definition { }
    ;

definition
    : defn { auto name = $1->name; defs_defn[name] = std::move($1); }
    | data { auto name = $1->name; defs_data[name] = std::move($1); }
    ;

Note that program and definitions no longer have a type, and that data and defn have been changed to return definition_data_ptr and definition_defn_ptr, respectively. This necessitates changes to our main file. First of all, we declare the two new maps we hope to receive from Bison:

24
25
extern std::map<std::string, definition_data_ptr> defs_data;
extern std::map<std::string, definition_defn_ptr> defs_defn;

We then change all affected functions, which in many cases amounts to splitting the program parameter into defs_data and defs_defn parameters. We also make other, largely mechanical changes: code iterating over definitions now requires the use of second to refer to the value stored in the map, and LLVM generation now needs to separately process the two different types of definitions. The biggest change occurs in typecheck_program, which not only undergoes all the aforementioned modifications, but is also updated to use topological ordering:

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
void typecheck_program(
        const std::map<std::string, definition_data_ptr>& defs_data,
        const std::map<std::string, definition_defn_ptr>& defs_defn,
        type_mgr& mgr, type_env_ptr& env) {
    type_ptr int_type = type_ptr(new type_base("Int")); 
    env->bind_type("Int", int_type);

    type_ptr binop_type = type_ptr(new type_arr(
                int_type,
                type_ptr(new type_arr(int_type, int_type))));
    env->bind("+", binop_type);
    env->bind("-", binop_type);
    env->bind("*", binop_type);
    env->bind("/", binop_type);

    for(auto& def_data : defs_data) {
        def_data.second->insert_types(mgr, env);
    }
    for(auto& def_data : defs_data) {
        def_data.second->insert_constructors();
    }

    function_graph dependency_graph;

    for(auto& def_defn : defs_defn) {
        def_defn.second->find_free(mgr, env);
        dependency_graph.add_function(def_defn.second->name);

        for(auto& dependency : def_defn.second->free_variables) {
            if(defs_defn.find(dependency) == defs_defn.end())
                throw 0;
            dependency_graph.add_edge(def_defn.second->name, dependency);
        }
    }

    std::vector<group_ptr> groups = dependency_graph.compute_order();
    for(auto it = groups.rbegin(); it != groups.rend(); it++) {
        auto& group = *it;
        for(auto& def_defnn_name : group->members) {
            auto& def_defn = defs_defn.find(def_defnn_name)->second;
            def_defn->insert_types(mgr);
        }
        for(auto& def_defnn_name : group->members) {
            auto& def_defn = defs_defn.find(def_defnn_name)->second;
            def_defn->typecheck(mgr);
        }
        for(auto& def_defnn_name : group->members) {
            env->generalize(def_defnn_name, mgr);
        }
    }

    for(auto& pair : env->names) {
        std::cout << pair.first << ": ";
        pair.second->print(mgr, std::cout);
        std::cout << std::endl;
    }
}

The above code uses the yet-unexplained generalize method. What’s going on?

Observe that the Var rule of the Hindley-Milner type system says that a variable xx can have a polytype in the environment Γ\Gamma. Our type_ptr can only represent monotypes, so we must change what type_env associates with names to a new struct for representing polytypes, which we will call type_scheme. The type_scheme struct, just like the formal definition of a polytype, contains zero or more “forall”-quantified type variables, followed by a monotype which may use these variables:

17
18
19
20
21
22
23
24
25
26

struct type_scheme {
    std::vector<std::string> forall;
    type_ptr monotype;

    type_scheme(type_ptr type) : forall(), monotype(std::move(type)) {}

    void print(const type_mgr& mgr, std::ostream& to) const;
    type_ptr instantiate(type_mgr& mgr) const;
};

The type_scheme::instantiate method is effectively an implementation of the special case of the Inst rule, in which a polytype is specialized to a monotype. Since the App and Case rules only use monotypes, we’ll be using this special case a lot. We implement this method as follows:

34
35
36
37
38
39
40
41
type_ptr type_scheme::instantiate(type_mgr& mgr) const {
    if(forall.size() == 0) return monotype;
    std::map<std::string, type_ptr> subst;
    for(auto& var : forall) {
        subst[var] = mgr.new_type();
    }
    return substitute(mgr, subst, monotype);
}

In the above code, if the type scheme represents a monotype (i.e., it has no quantified variables), we simply return that monotype. Otherwise, we must perform a substitution, replacing “forall”-quantified variables with fresh type parameters to be determined (we will never determine a single type for any of the quantified variables, since they are specifically meant to represent any type). We build a substitution map, which assigns to each quantified type variable a corresponding “fresh” type, and then create a new type with with the substitution applied using substitute, which is implemented as follows:

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
type_ptr substitute(const type_mgr& mgr, const std::map<std::string, type_ptr>& subst, const type_ptr& t) {
    type_var* var;
    type_ptr resolved = mgr.resolve(t, var);
    if(var) {
        auto subst_it = subst.find(var->name);
        if(subst_it == subst.end()) return resolved;
        return subst_it->second;
    } else if(type_arr* arr = dynamic_cast<type_arr*>(t.get())) {
        auto left_result = substitute(mgr, subst, arr->left);
        auto right_result = substitute(mgr, subst, arr->right);
        if(left_result == arr->left && right_result == arr->right) return t;
        return type_ptr(new type_arr(left_result, right_result));
    }
    return t;
}

In principle, the function is fairly simple: if the current type is equivalent to a quantified type, we return the corresponding “fresh” type. If, on the other hand, the type represents a function, we perform a substitution in the function’s input and output types. This method avoids creating new types where possible; a new type is only created if a function’s input or output type is changed by a substitution (in which case, the function itself is changed by the substitution). In all other cases, substitution won’t do anything, so we just return the original type.

Now it is a bit more clear why we saw instantiate in a code snippet some time ago; to compute a monotype for a variable reference, we must take into account the possibility that the variable has a polymorphic type, which needs to be specialized (potentially differently in every occurrence of the variable).

When talking about our new typechecking algorithm, we mentioned using Gen to sprinkle polymorphism into our program. If it can, Gen will add free variables in a type to the “forall” quantifier at the front, making that type polymorphic. We implement this using a new generalize method added to the type_env, which (as per convention) generalizes the type of a given variable as much as possible:

31
32
33
34
35
36
37
38
39
40
41
void type_env::generalize(const std::string& name, type_mgr& mgr) {
    auto names_it = names.find(name);
    if(names_it == names.end()) throw 0;
    if(names_it->second->forall.size() > 0) throw 0;

    std::set<std::string> free_variables;
    mgr.find_free(names_it->second->monotype, free_variables);
    for(auto& free : free_variables) {
        names_it->second->forall.push_back(free);
    }
}

For now, we disallow types to be generalized twice, and we naturally disallow generalizing types of nonexistent variables. If neither of those things occurs, we find all the free variables in the variable’s current type using a new method called type_mgr::find_free, and put them into the “forall” quantifier. type_mgr::find_free is implemented as follows:

138
139
140
141
142
143
144
145
146
147
148
void type_mgr::find_free(const type_ptr& t, std::set<std::string>& into) const {
    type_var* var;
    type_ptr resolved = resolve(t, var);

    if(var) {
        into.insert(var->name);
    } else if(type_arr* arr = dynamic_cast<type_arr*>(resolved.get())) {
        find_free(arr->left, into);
        find_free(arr->right, into);
    }
}

The above code is fairly straightforward; if a type is a variable that is not yet bound to anything, it is free; if the type is a function, we search for free variables in its input and output types; otherwise, the type has no free variables.

Finally, we have made the necessary changes. Let’s test it out with the example from the beginning:

data Bool = { True, False }
defn if c t e = {
    case c of {
        True -> { t }
        False -> { e }
    }
}
defn main = { if (if True False True) 11 3 }

Running it, we get the output:

3

Hooray!

While this is a major success, we are not yet done. Although our functions can now have polymorphic types, the same cannot be said for our data types! We want to have lists of integers and lists of booleans, without having to duplicate any code! While this also falls into the category of polymorphism, this post has already gotten very long, and we will return to it in part 11. Once we’re done with that, I still intend to go over let/in expressions, lambda functions, and Input/Output together with strings.