In part 10, we managed to get our compiler to accept functions that were polymorphically typed. However, a piece of the puzzle is still missing: while our functions can handle values of different types, the same cannot be said for our data types. This means that we cannot construct data structures that can contain arbitrary types. While we can define and use a list of integers, if we want to also have a list of booleans, we must copy all of our constructors and define a new data type. Worse, not only do we have to duplicate the constructors, but also all the functions that operate on the list. As far as our compiler is concerned, a list of integers and a list of booleans are entirely different beasts, and cannot be operated on by the same code.

To make polymorphic data types possible, we must extend our language (and type system) a little. We will now allow for something like this:

data List a = { Nil, Cons a List }


In the above snippet, we are no longer declaring a single type, but a collection of related types, parameterized by a type a. Any type can take the place of a to get a list containing that type of element. Then, List Int is a type, as is List Bool and List (List Int). The constructors in the snippet also get polymorphic types:

$\text{Nil} : \forall a \; . \; \text{List} \; a \\ \text{Cons} : \forall a \; . \; a \rightarrow \text{List} \; a \rightarrow \text{List} \; a$

When you call Cons, the type of the resulting list varies with the type of element you pass in. The empty list Nil is a valid list of any type, since, well, it’s empty.

Let’s talk about List itself, now. I suggest that we ponder the following table:

$\text{List}$ $\text{Cons}$
$\text{List}$ is not a type; it must be followed up with arguments, like $\text{List} \; \text{Int}$. $\text{Cons}$ is not a list; it must be followed up with arguments, like $\text{Cons} \; 3 \; \text{Nil}$.
$\text{List} \; \text{Int}$ is in its simplest form. $\text{Cons} \; 3 \; \text{Nil}$ is in its simplest form.
$\text{List} \; \text{Int}$ is a type. $\text{Cons} \; 3 \; \text{Nil}$ is a value of type $\text{List} \; \text{Int}$.

I hope that the similarities are quite striking. I claim that List is quite similar to a constructor Cons, except that it occurs in a different context: whereas Cons is a way to create values, List is a way to create types. Indeed, while we call Cons a constructor, it’s typical to call List a type constructor. We know that Cons is a function which assigns to values (like 3 and Nil) other values (like Cons 3 Nil, or [3] for short). In a similar manner, List can be thought of as a function that assigns to types (like Int) other types (like List Int). We can even claim that it has a type:

$\text{List} : \text{Type} \rightarrow \text{Type}$

[note: When your type constructors take as input not only other types but also values such as 3, you've ventured into the territory of dependent types. This is a significant step up in complexity from what we'll be doing in this series. If you're interested, check out Idris (if you want to know about dependent types for functional programming) or Coq (to see how propositions and proofs can be encoded in a dependently typed language). ] our type constructors will only take zero or more types as input, and produce a type as output. In this case, writing $\text{Type}$ becomes quite repetitive, and we will adopt the convention of writing $*$ instead. The types of such constructors are called kinds. Let’s look at a few examples, just to make sure we’re on the same page:

• The kind of $\text{Bool}$ is $*$: it does not accept any type arguments, and is a type in its own right.
• The kind of $\text{List}$ is $*\rightarrow *$: it takes one argument (the type of the things inside the list), and creates a type from it.
• If we define a pair as data Pair a b = { MkPair a b }, then its kind is $* \rightarrow * \rightarrow *$, because it requires two parameters.

As one final observation, we note that effectively, all we’re doing is tracking the arity of the constructor type.

Let’s now enumerate all the possible forms that (mono)types can take in our system:

1. A type can be a placeholder, like $a$, $b$, etc.
2. A type can be a type constructor, applied to [note: It is convenient to treat regular types (like $\text{Bool}$) as type constructors of arity 0 (that is, type constructors with kind $*$). In effect, they take zero arguments and produce types (themselves). ] such as $\text{List} \; \text{Int}$ or $\text{Bool}$.
3. A function from one type to another, like $\text{List} \; a \rightarrow \text{Int}$.

Polytypes (type schemes) in our system can be all of the above, but may also include a “forall” quantifier at the front, generalizing the type (like $\forall a \; . \; \text{List} \; a \rightarrow \text{Int}$).

Let’s start implementing all of this. Why don’t we start with the change to the syntax of our language? We have complicated the situation quite a bit. Let’s take a look at the old grammar for data type declarations (this is going back as far as part 2). Here, $L_D$ is the nonterminal for the things that go between the curly braces of a data type declaration, $D$ is the nonterminal representing a single constructor definition, and $L_U$ is a list of zero or more uppercase variable names:

\begin{aligned} L_D & \rightarrow D \; , \; L_D \\ L_D & \rightarrow D \\ D & \rightarrow \text{upperVar} \; L_U \\ L_U & \rightarrow \text{upperVar} \; L_U \\ L_U & \rightarrow \epsilon \end{aligned}

This grammar was actually too simple even for our monomorphically typed language! Since functions are not represented using a single uppercase variable, it wasn’t possible for us to define constructors that accept as arguments anything other than integers and user-defined data types. Now, we also need to modify this grammar to allow for constructor applications (which can be nested). To do all of these things, we will define a new nonterminal, $Y$, for types:

\begin{aligned} Y & \rightarrow N \; \rightarrow" Y \\ Y & \rightarrow N \end{aligned}

We make it right-recursive (because the $\rightarrow$ operator is right-associative). Next, we define a nonterminal for all types except those constructed with the arrow, $N$.

\begin{aligned} N & \rightarrow \text{upperVar} \; L_Y \\ N & \rightarrow \text{typeVar} \\ N & \rightarrow ( Y ) \end{aligned}

The first of the above rules allows a type to be a constructor applied to zero or more arguments (generated by $L_Y$). The second rule allows a type to be a placeholder type variable. Finally, the third rule allows for any type (including functions, again) to occur between parentheses. This is so that higher-order functions, like $(a \rightarrow b) \rightarrow a \rightarrow a$, can be represented.

Unfortunately, the definition of $L_Y$ is not as straightforward as we imagine. We could define it as just a list of $Y$ nonterminals, but this would make the grammar ambigous: something like List Maybe Int could be interpreted as “List, applied to types Maybe and Int”, and “List, applied to type Maybe Int”. To avoid this, we define a “type list element” $Y’$, which does not take arguments:

\begin{aligned} Y' & \rightarrow \text{upperVar} \\ Y' & \rightarrow \text{lowerVar} \\ Y' & \rightarrow ( Y ) \end{aligned}

We then make $L_Y$ a list of $Y’$:

\begin{aligned} L_Y & \rightarrow Y' \; L_Y \\ L_Y & \rightarrow \epsilon \end{aligned}

Finally, we update the rules for the data type declaration, as well as for a single constructor. In these new rules, we use $L_T$ to mean a list of type variables. The rules are as follows:

\begin{aligned} T & \rightarrow \text{data} \; \text{upperVar} \; L_T = \{ L_D \} \\ D & \rightarrow \text{upperVar} \; L_Y \\ \end{aligned}

Those are all the changes we have to make to our grammar. Let’s now move on to implementing the corresponding data structures. We define a new family of structs, which represent types as they are received from the parser. These differ from regular types in that they do not necessarily represent valid types; validating types requires two passes, whereas parsing is done in a single pass. We can define our parsed types as follows:

#pragma once
#include
#include
#include
#include "type_env.hpp"

struct parsed_type {
virtual type_ptr to_type(
const std::set<std::string>& vars,
const type_env& env) const = 0;
};

using parsed_type_ptr = std::unique_ptr<parsed_type>;

struct parsed_type_app : parsed_type {
std::string name;
std::vector<parsed_type_ptr> arguments;

parsed_type_app(
std::string n,
std::vector<parsed_type_ptr> as)
: name(std::move(n)), arguments(std::move(as)) {}

type_ptr to_type(const std::set<std::string>& vars, const type_env& env) const;
};

struct parsed_type_var : parsed_type {
std::string var;

parsed_type_var(std::string v) : var(std::move(v)) {}

type_ptr to_type(const std::set<std::string>& vars, const type_env& env) const;
};

struct parsed_type_arr : parsed_type {
parsed_type_ptr left;
parsed_type_ptr right;

parsed_type_arr(parsed_type_ptr l, parsed_type_ptr r)
: left(std::move(l)), right(std::move(r)) {}

type_ptr to_type(const std::set<std::string>& vars, const type_env& env) const;
};


We define the conversion method to_type, which requires a set of type variables that are allowed to occur within a parsed type (which are the variables specified on the left of the = in the data type declaration syntax), and the environment in which to look up the arities of any type constructors. The implementation is as follows:

#include "parsed_type.hpp"
#include "type.hpp"
#include "type_env.hpp"

type_ptr parsed_type_app::to_type(
const std::set<std::string>& vars,
const type_env& e) const {
auto parent_type = e.lookup_type(name);
if(parent_type == nullptr) throw 0;
type_base* base_type;
if(!(base_type = dynamic_cast<type_base*>(parent_type.get()))) throw 0;
if(base_type->arity != arguments.size()) throw 0;

type_app* new_app = new type_app(std::move(parent_type));
type_ptr to_return(new_app);
for(auto& arg : arguments) {
new_app->arguments.push_back(arg->to_type(vars, e));
}
}

type_ptr parsed_type_var::to_type(
const std::set<std::string>& vars,
const type_env& e) const {
if(vars.find(var) == vars.end()) throw 0;
return type_ptr(new type_var(var));
}

type_ptr parsed_type_arr::to_type(
const std::set<std::string>& vars,
const type_env& env) const {
auto new_left = left->to_type(vars, env);
auto new_right = right->to_type(vars, env);
return type_ptr(new type_arr(std::move(new_left), std::move(new_right)));
}


Note that this definition requires a new type subclass, type_app, which represents type application. Unlike parsed_type_app, it stores a pointer to the type constructor being applied, rather than its name. This helps validate the type (by making sure the parsed type’s name refers to an existing type constructor), and lets us gather information like which constructors the resulting type has. We define this new type as follows:

 70 71 72 73 74 75 76 77 78  struct type_app : public type { type_ptr constructor; std::vector arguments; type_app(type_ptr c) : constructor(std::move(c)) {} void print(const type_mgr& mgr, std::ostream& to) const; }; 

With our new data structures in hand, we can now update the grammar in our Bison file. First things first, we’ll add the type parameters to the data type definition:

 127 128 129 130  data : DATA UID lowercaseParams EQUAL OCURLY constructors CCURLY { $$= definition_data_ptr(new definition_data(std::move(2), std::move(3), std::move(6))); } ; Next, we add the new grammar rules we came up with:  138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163  constructor : UID typeList {$$ = constructor_ptr(new constructor(std::move($1), std::move($2))); } ; type : nonArrowType ARROW type { $$= parsed_type_ptr(new parsed_type_arr(std::move(1), std::move(3))); } | nonArrowType {$$ = std::move($1); } ; nonArrowType : UID typeList { $$= parsed_type_ptr(new parsed_type_app(std::move(1), std::move(2))); } | LID {$$ = parsed_type_ptr(new parsed_type_var(std::move($1))); } | OPAREN type CPAREN { $$= std::move(2); } ; typeListElement : OPAREN type CPAREN {$$ = std::move($2); } | UID { $$= parsed_type_ptr(new parsed_type_app(std::move(1), {})); } | LID {$$ = parsed_type_ptr(new parsed_type_var(std::move($1))); } ; typeList : %empty { $$= std::vector(); } | typeList typeListElement {$$ = std::move($1);$$.push_back(std::move($2)); } ;

Note in the above rules that even for typeListElement, which can never be applied to any arguments, we still attach a parsed_type_app as the semantic value. This is for consistency; it’s easier to view all types in our system as applications to zero or more arguments, than to write coercions from non-applied types to types applied to zero arguments.

Finally, we define the types for these new rules at the top of the file:

 43 44  %type > typeList %type type nonArrowType typeListElement

This concludes our work on the parser, but opens up a whole can of worms elsewhere. First of all, now that we introduced a new type subclass, we must ensure that type unification still works as intended. We therefore have to adjust the type_mgr::unify method:

 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132  void type_mgr::unify(type_ptr l, type_ptr r) { type_var *lvar, *rvar; type_arr *larr, *rarr; type_base *lid, *rid; type_app *lapp, *rapp; l = resolve(l, lvar); r = resolve(r, rvar); if(lvar) { bind(lvar->name, r); return; } else if(rvar) { bind(rvar->name, l); return; } else if((larr = dynamic_cast(l.get())) && (rarr = dynamic_cast(r.get()))) { unify(larr->left, rarr->left); unify(larr->right, rarr->right); return; } else if((lid = dynamic_cast(l.get())) && (rid = dynamic_cast(r.get()))) { if(lid->name == rid->name && lid->arity == rid->arity) return; } else if((lapp = dynamic_cast(l.get())) && (rapp = dynamic_cast(r.get()))) { unify(lapp->constructor, rapp->constructor); auto left_it = lapp->arguments.begin(); auto right_it = rapp->arguments.begin(); while(left_it != lapp->arguments.end() && right_it != rapp->arguments.end()) { unify(*left_it, *right_it); left_it++, right_it++; } return; } throw unification_error(l, r); } 

In the above snippet, we add a new if-statement that checks whether or not both types being unified are type applications, and if so, unifies their constructors and arguments. We also extend our type equality check to ensure that both the names and arities of types match [note: This is actually a pretty silly measure. Consider the following three propositions: 1) types are only declared at the top-level scope. 2) if a type is introduced, and another type with that name already exists, we throw an error. 3) for name equality to be insufficient, we need to have two declared types with the same name. Given these propositions, it will not be possible for us to declare two types that would confuse the name equality check. However, in the near future, these propositions may not all hold: if we allow let/in expressions to contain data type definitions, it will be possible to declare two types with the same name and arity (in different scopes), which would still confuse the check. In the future, if this becomes an issue, we will likely move to unique type identifiers. ] Note also the more basic fact that we added arity to our type_base, [note: You may be wondering, why did we add arity to base types, rather than data types? Although so far, our language can only create type constructors from data type definitions, it's possible (or even likely) that we will have polymorphic built-in types, such as the IO monad. To prepare for this, we will allow our base types to be type constructors too. ]

Jut as we change type_mgr::unify, we need to change type_mgr::find_free to include the new case of type_app. The adjusted function looks as follows:

 174 175 176 177 178 179 180 181 182 183 184 185 186 187  void type_mgr::find_free(const type_ptr& t, std::set& into) const { type_var* var; type_ptr resolved = resolve(t, var); if(var) { into.insert(var->name); } else if(type_arr* arr = dynamic_cast(resolved.get())) { find_free(arr->left, into); find_free(arr->right, into); } else if(type_app* app = dynamic_cast(resolved.get())) { find_free(app->constructor, into); for(auto& arg : app->arguments) find_free(arg, into); } } 

There is another adjustment that we have to make to our type code. Recall that we had code that implemented substitutions: replacing free variables with other types to properly implement our type schemes. There was a bug in that code, which becomes much more apparent when the substitution system is put under more pressure. Specifically, the bug was in how type variables were handled.

The old substitution code, when it found that a type variable had been bound to another type, always moved on to perform a substitution in that other type. This wasn’t really a problem then, since any type variables that needed to be substituted were guaranteed to be free (that’s why they were put into the “forall” quantifier). However, with our new system, we are using user-provided type variables (usually a, b, and so on), which have likely already been used by our compiler internally, and thus have been bound to something. That something is irrelevant to us: when we perform a substitution on a user-defined data type, we know that our a is free, and should be substitited. In short, precedence should be given to substituting type variables, rather than resolving them to what they are bound to.

To make this adjustment possible, we need to make substitute a method of type_manager, since it will now require an awareness of existing type bindings. Additionally, this method will now perform its own type resolution, checking if a type variable needs to be substitited between each step. The whole code is as follows:

 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165  type_ptr type_mgr::substitute(const std::map& subst, const type_ptr& t) const { type_ptr temp = t; while(type_var* var = dynamic_cast(temp.get())) { auto subst_it = subst.find(var->name); if(subst_it != subst.end()) return subst_it->second; auto var_it = types.find(var->name); if(var_it == types.end()) return t; temp = var_it->second; } if(type_arr* arr = dynamic_cast(temp.get())) { auto left_result = substitute(subst, arr->left); auto right_result = substitute(subst, arr->right); if(left_result == arr->left && right_result == arr->right) return t; return type_ptr(new type_arr(left_result, right_result)); } else if(type_app* app = dynamic_cast(temp.get())) { auto constructor_result = substitute(subst, app->constructor); bool arg_changed = false; std::vector new_args; for(auto& arg : app->arguments) { auto arg_result = substitute(subst, arg); arg_changed |= arg_result != arg; new_args.push_back(std::move(arg_result)); } if(constructor_result == app->constructor && !arg_changed) return t; type_app* new_app = new type_app(std::move(constructor_result)); std::swap(new_app->arguments, new_args); return type_ptr(new_app); } return t; } 

That’s all for types. Definitions, though, need some work. First of all, we’ve changed our parser to feed our constructor type a vector of parsed_type_ptr, rather than std::string. We therefore have to update constructor to receive and store this new vector:

 13 14 15 16 17 18 19 20  struct constructor { std::string name; std::vector types; int8_t tag; constructor(std::string n, std::vector ts) : name(std::move(n)), types(std::move(ts)) {} }; 

Similarly, definition_data itself needs to accept the list of type variables it has:

 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70  struct definition_data { std::string name; std::vector vars; std::vector constructors; type_env_ptr env; definition_data( std::string n, std::vector vs, std::vector cs) : name(std::move(n)), vars(std::move(vs)), constructors(std::move(cs)) {} void insert_types(type_env_ptr& env); void insert_constructors() const; void generate_llvm(llvm_context& ctx); }; 

We then look at definition_data::insert_constructors, which converts constructor instances to actual constructor functions. The code, which is getting pretty complciated, is as follows:

 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92  void definition_data::insert_constructors() const { type_ptr this_type_ptr = env->lookup_type(name); type_data* this_type = static_cast(this_type_ptr.get()); int next_tag = 0; std::set var_set; type_app* return_app = new type_app(std::move(this_type_ptr)); type_ptr return_type(return_app); for(auto& var : vars) { if(var_set.find(var) != var_set.end()) throw 0; var_set.insert(var); return_app->arguments.push_back(type_ptr(new type_var(var))); } for(auto& constructor : constructors) { constructor->tag = next_tag; this_type->constructors[constructor->name] = { next_tag++ }; type_ptr full_type = return_type; for(auto it = constructor->types.rbegin(); it != constructor->types.rend(); it++) { type_ptr type = (*it)->to_type(var_set, env); full_type = type_ptr(new type_arr(type, full_type)); } type_scheme_ptr full_scheme(new type_scheme(std::move(full_type))); full_scheme->forall.insert(full_scheme->forall.begin(), vars.begin(), vars.end()); env->bind(constructor->name, full_scheme); } } 

In the above snippet, we do the following things:

1. We first create a set of type variables that can occur in this type’s constructors (the same set that’s used by the to_type method we saw earlier). While doing this, we ensure a type variable is not used twice (this is not allowed), and add each type variable to the final return type (which is something like List a), in the order they occur.
2. When the variables have been gathered into a set, we iterate over all constructors, and convert them into types by calling to_type on their arguments, then assembling the resulting argument types into a function. This is not enough, however, [note: This is also not enough because without generalization using "forall", we are risking using type variables that have already been bound, or that will be bound. Even if a has not yet been used by the typechecker, it will be once the type manager generates its first type variable, and things will go south. If we, for some reason, wanted type constructors to be monomorphic (but generic, with type variables) we'd need to internally instnatiate fresh type variables for every user-defined type variable, and substitute them appropriately. ] as we have discussed above with $\text{Nil}$ and $\text{Cons}$. To accomodate for this, we also add all type variables to the “forall” quantifier of a new type scheme, whose monotype is our newly assembled function type. This type scheme is what we store as the type of the constructor.

This was the last major change we have to perform. The rest is cleanup: we have switched our system to dealing with type applications (sometimes with zero arguments), and we must bring the rest of the compiler up to speed with this change. For instance, we update ast_int to create a reference to an existing integer type during typechecking:

 20 21 22  type_ptr ast_int::typecheck(type_mgr& mgr) { return type_ptr(new type_app(env->lookup_type("Int"))); } 

Similarly, we update our code in typecheck_program to use type applications in the type for binary operations:

 31 32 33 34 35 36 37   type_ptr int_type = type_ptr(new type_base("Int")); env->bind_type("Int", int_type); type_ptr int_type_app = type_ptr(new type_app(int_type)); type_ptr binop_type = type_ptr(new type_arr( int_type_app, type_ptr(new type_arr(int_type_app, int_type_app)))); 

Finally, we update ast_case to unwrap type applications to get the needed constructor data from type_data. This has to be done in ast_case::typecheck, as follows:

 163 164 165 166 167 168   input_type = mgr.resolve(case_type, var); type_app* app_type; if(!(app_type = dynamic_cast(input_type.get())) || !dynamic_cast(app_type->constructor.get())) { throw type_error("attempting case analysis of non-data type"); } 

Additionally, a similar change needs to be made in ast_case::compile:

 174 175   type_app* app_type = dynamic_cast(input_type.get()); type_data* type = dynamic_cast(app_type->constructor.get()); 

That should be all! Let’s try an example:

data List a = { Nil, Cons a (List a) }
data Bool = { True, False }
defn length l = {
case l of {
Nil -> { 0 }
Cons x xs -> { 1 + length xs }
}
}
defn main = { length (Cons 1 (Cons 2 (Cons 3 Nil))) + length (Cons True (Cons False (Cons True Nil))) }


The output:

Result: 6


Yay! Not only were we able to define a list of any type, but our length function correctly determined the lengths of two lists of different types! Let’s try an example with the classic fold functions:

data List a = { Nil, Cons a (List a) }

defn map f l = {
case l of {
Nil -> { Nil }
Cons x xs -> { Cons (f x) (map f xs) }
}
}

defn foldl f b l = {
case l of {
Nil -> { b }
Cons x xs -> { foldl f (f b x) xs }
}
}

defn foldr f b l = {
case l of {
Nil -> { b }
Cons x xs -> { f x (foldr f b xs) }
}
}

defn list = { Cons 1 (Cons 2 (Cons 3 (Cons 4 Nil))) }

defn add x y = { x + y }
defn sum l = { foldr add 0 l }

defn skipAdd x y = { y + 1 }
defn length l = { foldr skipAdd 0 l }

defn main = { sum list + length list }


We expect the sum of the list [1,2,3,4] to be 10, and its length to be 4, so the sum of the two should be 14. And indeed, our program agrees:

Result: 14


Let’s do one more example, to test types that take more than one type parameter:

data Pair a b = { MkPair a b }

defn fst p = {
case p of {
MkPair a b -> { a }
}
}

defn snd p = {
case p of {
MkPair a b -> { b }
}
}

defn pair = { MkPair 1 (MkPair 2 3) }

defn main = { fst pair + snd (snd pair) }


Once again, the compiled program gives the expected result:

Result: 4


This looks good! We have added support for polymorphic data types to our compiler. We are now free to move on to let/in expressions, lambda functions, and Input/Output, as promised, starting with part 12 - let/in and lambdas!