“When will you learn? When will you learn that your actions have consequences?

So far, we’ve entirely ignored the problem of memory management. Every time that we need a new node for our growing graph, we simply ask for more memory from the runtime with malloc. But selfishly, even when we no longer require the memory allocated for a particular node, when that node is no longer in use, we do not free it. In fact, our runtime currently has no idea about which nodes are needed and which ones are ready to be discarded.

To convince ourselves that this is a problem, let’s first assess the extent of the damage. Consider the program from works3.txt:

data List = { Nil, Cons Int List }
defn length l = {
case l of {
Nil -> { 0 }
Cons x xs -> { 1 + length xs }
}
}
defn main = { length (Cons 1 (Cons 2 (Cons 3 Nil))) }


Compiling and running this program through valgrind, we get the following output:

==XXXX== LEAK SUMMARY:
==XXXX==    definitely lost: 288 bytes in 12 blocks
==XXXX==    indirectly lost: 768 bytes in 34 blocks
==XXXX==      possibly lost: 0 bytes in 0 blocks
==XXXX==    still reachable: 0 bytes in 0 blocks
==XXXX==         suppressed: 0 bytes in 0 blocks


We lost 1056 bytes of memory, just to return the length of a list with 3 elements. The problem of leaking memory is very real.

How do we solve this issue? We can’t embed memory management into our language; We want to keep it pure, and managing memory is typically pretty far from that goal. Instead, we will make our runtime do the work of freeing memory. Even then, this is a nontrivial goal: our runtime manipulates graphs, each of which can be combined with others in arbitrary ways. In general, there will not always be a single node that, when freed, will guarantee that another node can be freed as well. Instead, it’s very possible in our graphs that two parent nodes both refer to a third, and only when both parents are freed can we free that third node itself. Consider, for instance, the function square as follows:

defn square x = {
x * x
}


This function will receive, on top of the stack, a single graph representing x. It will then create two applications of a global (+) function, each time to the graph of x. Thus, it will construct a tree with two App nodes, both of which [note: We later take advantage of this, by replacing the graph of x with the result of evaluating it. Since both App nodes point to the same graph, when we evaluate it once, each node observes this update, and is not required to evaluate x again. With this, we achieve lazy evaluation. ] The runtime will have to wait until both App nodes are freed before it can free the graph of x.

This seems simple enough! If there are multiple things that may reference a node in the graph, why don’t we just keep track of how many there are? Once we know that no more things are still referencing a node, we can free it. This is called reference counting. Reference counting is a valid technique, but unfortunately, it will not suit us. The reason for this is that our language may produce cyclic graphs. Consider, for example, this definition of an infinite list of the number 1:

defn ones = { Cons 1 ones }


Envisioning the graph of the tree, we can see ones as an application of the constructor Cons to two arguments, one of which is ones again. [note: Things are actually more complicated than this. In our current language, recursive definitions are only possible in function definitions (like ones). In our runtime, each time there is a reference to a function, this is done through a new node, which means that functions with recursive definitions are not represented cyclically. Therefore, reference counting would work. However, in the future, our language will have more ways of creating circular definitions, some of which will indeed create cycles in our graphs. So, to prepare for this, we will avoid the use of reference counting. ] In this case, when we compute the number of nodes that require ones, we will always find the number to be at least 1: ones needs ones, which needs ones, and so on. It will not be possible for us to free ones, then, by simply counting the number of references to it.

There’s a more powerful technique than reference counting for freeing unused memory: mark-and-sweep garbage collection. This technique is conceptually pretty simple to grasp, yet will allow us to handle cycles in our graphs. Unsurprisingly, we implement this type of garbage collection in two stages:

1. Mark: We go through every node that is still needed by the runtime, and recursively mark it, its children, and so on as “to keep”.
2. Sweep: We go through every node we haven’t yet freed, and, if it hasn’t been marked as “to keep”, we free it.

This also seems simple enough. There are two main things for us to figure out:

1. For Mark, what are the “nodes still needed by the runtime”? These are just the nodes on the various G-machine stacks. If a node is not on the stack, nor is it a child of a node that is on the stack, why should we keep it around?
2. For Sweep, how do we keep track of all the nodes we haven’t yet freed? In our case, the solution is a global list of allocated nodes, which is updated every time that a node is allocated.

Wait a minute, though. Inside of unwind in C, we only have a reference to the most recent stack. Our execution model allows for an arbitrary number of stacks: we can keep using Eval, placing the current stack on the dump, and starting a new stack from scratch to evaluate a node. How can we traverse these stacks from inside unwind? One solution could be to have each stack point to the “parent” stack. To find all the nodes on the stack, then, we’d start with the current stack, mark all the nodes on it as “required”, then move on to the parent stack, rinse and repeat. This is plausible and pretty simple, but there’s another way.

We clean up after ourselves.

### Towards a Cleaner Stack

The G-machine compilation rules Simon Peyton Jones presents are written in a particular way. Every time that a function is called, all it leaves behind on the stack is the graph node that represents the function’s output. Our own internal functions, however, have been less careful. Consider, for instance, the “binary operator” function I showed you. Its body is given by the following G-machine instructions:

instructions.push_back(instruction_ptr(new instruction_push(1)));
instructions.push_back(instruction_ptr(new instruction_eval()));
instructions.push_back(instruction_ptr(new instruction_push(1)));
instructions.push_back(instruction_ptr(new instruction_eval()));
instructions.push_back(instruction_ptr(new instruction_binop(op)));


When the function is called, there are at least 3 things on the stack:

1. The “outermost” application node, to be replaced with an indirection (to enable laziness).
2. The second argument to the binary operator.
3. The first argument to the binary operator.

Then, Push adds another node to the stack, an Eval forces its evaluation (and leaves it on the stack). This happens again with the second argument. Finally, we call BinOp, popping two values off the stack and combining them according to the binary operator. This leaves the stack with 4 things: the 3 I described above, and thew newly computed value. This is fine as far as eval is concerned: its implementation only asks for the top value on the stack after unwind finishes. But for anything more complicated, this is a very bad side effect. We want to leave the stack as clean as we found it - with one node and no garbage.

Fortunately, the way we compile functions is a good guide for how we should compile internal operators and constructors. The idea is captured by the two instructions we insert at the end of a user-defined function:

    instructions.push_back(instruction_ptr(new instruction_update(params.size())));
instructions.push_back(instruction_ptr(new instruction_pop(params.size())));


Once a result is computed, we turn the node that represented the application into an indirection, and point it to the computed result (as I said before, this enables lazy evaluation). We also pop the arguments given to the function off the stack. Let’s add these two things to the gen_llvm_internal_op function:

void gen_llvm_internal_op(llvm_context& ctx, binop op) {
auto new_function = ctx.create_custom_function(op_action(op), 2);
std::vector<instruction_ptr> instructions;
instructions.push_back(instruction_ptr(new instruction_push(1)));
instructions.push_back(instruction_ptr(new instruction_eval()));
instructions.push_back(instruction_ptr(new instruction_push(1)));
instructions.push_back(instruction_ptr(new instruction_eval()));
instructions.push_back(instruction_ptr(new instruction_binop(op)));
instructions.push_back(instruction_ptr(new instruction_update(2)));
instructions.push_back(instruction_ptr(new instruction_pop(2)));
ctx.builder.SetInsertPoint(&new_function->getEntryBlock());
for(auto& instruction : instructions) {
instruction->gen_llvm(ctx, new_function);
}
ctx.builder.CreateRetVoid();
}


Notice, in particular, the instruction_update(2) and instruction_pop(2) instructions that were recently added. A similar thing has to be done for data type constructors. The difference, though, is that Pack removes the data it packs from the stack, and thus, Pop is not needed:

void definition_data::gen_llvm_first(llvm_context& ctx) {
for(auto& constructor : constructors) {
auto new_function =
ctx.create_custom_function(constructor->name, constructor->types.size());
std::vector<instruction_ptr> instructions;
instructions.push_back(instruction_ptr(
new instruction_pack(constructor->tag, constructor->types.size())
));
instructions.push_back(instruction_ptr(new instruction_update(0)));
ctx.builder.SetInsertPoint(&new_function->getEntryBlock());
for (auto& instruction : instructions) {
instruction->gen_llvm(ctx, new_function);
}
ctx.builder.CreateRetVoid();
}
}


With this done, let’s run a quick test: let’s print the number of things on the stack at the end of an eval call (before the stack is freed, of course). We can compare the output of runtime without the fix (old) and with the fix (current):

         current                    old

Current stack size is 0  |  Current stack size: 1
Current stack size is 0  |  Current stack size: 1
Current stack size is 0  |  Current stack size: 1
Current stack size is 0  |  Current stack size: 1
Current stack size is 0  |  Current stack size: 0
Current stack size is 0  |  Current stack size: 0
Current stack size is 0  |  Current stack size: 3
Current stack size is 0  |  Current stack size: 0
Current stack size is 0  |  Current stack size: 3
Current stack size is 0  |  Current stack size: 0
Current stack size is 0  |  Current stack size: 3
Result: 3                |  Result: 3


The stack is now much cleaner! Every time eval is called, it starts with one node, and ends with one node (which is then popped).

### One Stack to Rule Them All

Wait a minute. If the stack is really always empty at the end, do we really need to construct a new stack every time? [note: There's some nuance to this. While it is true that for the most part, we can get rid of the new stacks in favor of a single one, our runtime will experience a change. The change lies in the Unwind-Global rule, which requires that the stack has as many children as the function needs arguments. Until now, there was no way for this condition to be accidentally satisfied: the function we were unwinding was the only thing on the stack. Now, though, things are different: the function being unwound may share a stack with something else, and just checking the stack size will not be sufficient. I believe that this is not a problem for us, since the compiler will only emit Eval instructions for things it knows are data types or numbers, meaning their type is not a partially applied function that is missing arguments. However, this is a nontrivial observation. ] , and Simon Peyton Jones seems to agree. In Implementing Functional Languages: a tutorial, he mentions that the dump does not need to be implemented as a real stack of stacks. So let’s try this out: instead of starting a new stack using eval, let’s use an existing one, by just calling unwind again. To do so, all we have to do is change our instruction_eval instruction. When the G-machine wants something evaluated now, it should just call unwind directly!

To make this change, we have to make unwind available to the compiler. We thus declare it in the llvm_context.cpp file:

    functions["unwind"] = Function::Create(
FunctionType::get(void_type, { gmachine_ptr_type }, false),
Function::LinkageTypes::ExternalLinkage,
"unwind",
&module
);


And even create a function to construct a call to unwind with the following signature:

    void create_unwind(llvm::Function*);


We implement it like so:

void llvm_context::create_unwind(Function* f) {
auto unwind_f = functions.at("unwind");
builder.CreateCall(unwind_f, { f->args().begin() });
}


Finally, the instruction_eval::gen_llvm method simply calls unwind:

void instruction_eval::gen_llvm(llvm_context& ctx, Function* f) const {
ctx.create_unwind(f);
}


After this change, we only call eval from main. Furthermore, since eval releases all the resources it allocates before returning, we won’t be able to [note: We were able to do this before, but that's because our runtime didn't free the nodes, ever. Now that it does, returning a node violates that node's lifetime. ] the result of the evaluation from it. Thus, we simply merge eval with main - combining the printing and the initialization / freeing code.

With this, only one stack will be allocated for the entirety of program execution. This doesn’t just help us save on memory allocations, but also solves the problem of marking valid nodes during garbage collection! Instead of traversing a dump of stacks, we can now simply traverse a single stack; all that we need is in one place.

So this takes care, more or less, of the “mark” portion of mark-and-sweep. Using the stack, we can recursively mark the nodes that we need. But what about “sweeping”? How can we possibly know of every node that we’ve allocated? There’s some more bookkeping for us to do.

### It’s All Connected

There exists a simple technique I’ve previously seen (and used) for keeping track of all the allocated memory. The technique is to turn all the allocated nodes into elements of a linked list. The general process of implementing this proceeds as follows:

1. To each node, add a “next” pointer.
2. Keep a handle to the whole node chain somewhere.
3. Add each newly allocated node to the front of the whole chain.

This “somewhere” could be a global variable. However, since we already pass a stack to almost all of our functions, it makes more sense to make the list handle a part of some data structure that will also contain the stack, and pass that around, instead. This keeps all of the G-machine data in one place, and in principle could allow for concurrent execution of more than one G-machine in a single program. Let’s call our new data structure gmachine:

struct gmachine {
struct stack stack;
struct node_base* gc_nodes;
int64_t gc_node_count;
int64_t gc_node_threshold;
};


Here, the stack field holds the G-machine stack, and the gc_nodes is the handle to the list of all the nodes we’ve allocated and not yet freed. Don’t worry about the gc_node_count and gc_threshold fields - we’ll get to them a little later.

This is going to be a significant change. First of all, since the handle won’t be global, it can’t be accessed from inside the alloc_* functions. Instead, we have to make sure to add nodes allocated through alloc_* to a G-machine somewhere wherever we call the allocators. To make it easier to add nodes to a G-machine GC handle, let’s make a new function, track:

struct node_base* gmachine_track(struct gmachine*, struct node_base*);


This function will add the given node to the G-machine’s handle, and return that same node. This way, we can wrap nodes in a call to gmachine_track. We will talk about this function’s implementation later in the post.

This doesn’t get us all the way to a working runtime, though: right now, we still pass around struct stack* instead of struct gmachine* everywhere. However, the whole point of adding the gmachine struct was to store more data in it! Surely we need that new data somewhere, and thus, we need to use the gmachine struct for some functions. What functions do need a whole gmachine*, and which ones only need a stack*?

1. [note: This might not be clear. Maybe pushing onto a stack will add a node to our GC handle, and so, we need to have access to the handle in stack_push. The underlying question is that of ownership: when we allocate a node, which part of the program does it "belong" to? The "owner" of the node should do the work of managing when to free it or keep it. Since we already agreed to create a gmachine struct to house the GC handle, it makes sense that nodes are owned by the G-machine. Thus, the assumption in functions like stack_push is that the "owner" of the node already took care of allocating and tracking it, and stack_push itself shouldn't bother. ] stack_push, stack_pop, and similar functions do not require a G-machine.
2. stack_alloc and stack_pack do need a G-machine, because they must allocate new nodes. Where the nodes are allocated, we should add them to the GC handle.
3. Since they use stack_alloc and stack_pack, generated functions also need a G-machine.
4. Since unwind calls the generated functions, it must also receive a G-machine.

As far as stack functions go, we only need to update stack_alloc and stack_pack. Everything else doesn’t require new node allocations, and thus, does not require the GC handle. However, this makes our code rather ugly: we have a set of mostly stack_* functions, followed suddenly by two gmachine_* functions. In the interest of cleanliness, let’s instead do the following:

1. Make all functions associated with G-machine rules (like Alloc, Update, and so on) require a gmachine*. This way, theres a correspondence between our code and the theory.
2. Leave the rest of the functions (stack_push, stack_pop, etc.) as is. They are not G-machine specific, and don’t require a GC handle, so there’s no need to touch them.

Let’s make this change. We end up with the following functions:

struct stack {
size_t size;
size_t count;
struct node_base** data;
};

void stack_init(struct stack* s);
void stack_free(struct stack* s);
void stack_push(struct stack* s, struct node_base* n);
struct node_base* stack_pop(struct stack* s);
struct node_base* stack_peek(struct stack* s, size_t o);
void stack_popn(struct stack* s, size_t n);

struct gmachine {
struct stack stack;
struct node_base* gc_nodes;
int64_t gc_node_count;
int64_t gc_node_threshold;
};

void gmachine_init(struct gmachine* g);
void gmachine_free(struct gmachine* g);
void gmachine_slide(struct gmachine* g, size_t n);
void gmachine_update(struct gmachine* g, size_t o);
void gmachine_alloc(struct gmachine* g, size_t o);
void gmachine_pack(struct gmachine* g, size_t n, int8_t t);
void gmachine_split(struct gmachine* g, size_t n);
struct node_base* gmachine_track(struct gmachine* g, struct node_base* b);
void gmachine_gc(struct gmachine* g);

For the majority of the changed functions, the updates are [note: We must also update the LLVM/C++ declarations of the affected functions: many of them now take a gmachine_ptr_type instead of stack_ptr_type. This change is not shown explicitly here (it is hard to do with our growing code base), but it is nonetheless significant. ] The functions that require more significant modifications are gmachine_alloc and gmachine_pack. In both, we must now make a call to gmachine_track to ensure that a newly allocated node will be garbage collected in the future. The updated code for gmachine_alloc is:

void gmachine_alloc(struct gmachine* g, size_t o) {
while(o--) {
stack_push(&g->stack,
gmachine_track(g, (struct node_base*) alloc_ind(NULL)));
}
}

Correspondingly, the updated code for gmachine_pack is:

void gmachine_pack(struct gmachine* g, size_t n, int8_t t) {
assert(g->stack.count >= n);

struct node_base** data = malloc(sizeof(*data) * (n + 1));
assert(data != NULL);
memcpy(data, &g->stack.data[g->stack.count - n], n * sizeof(*data));
data[n] = NULL;

struct node_data* new_node = (struct node_data*) alloc_node();
new_node->array = data;
new_node->base.tag = NODE_DATA;
new_node->tag = t;

stack_popn(&g->stack, n);
stack_push(&g->stack, gmachine_track(g, (struct node_base*) new_node));
}

Note that we’ve secretly made one more change. Instead of allocating sizeof(*data) * n bytes of memory for the array of packed nodes, we allocate sizeof(*data) * (n + 1), and set the last element to NULL. This will allow other functions (which we will soon write) to know how many elements are packed inside a node_data (effectively, we’ve added a NULL terminator).

We must change our compiler to keep it up to date with this change. Importantly, it must know that a G-machine struct exists. To give it this information, we add a new llvm::StructType* called gmachine_type to the llvm_context class, initialize it in the constructor, and set its body as follows:

    gmachine_type->setBody(
stack_ptr_type,
node_ptr_type,
IntegerType::getInt64Ty(ctx),
IntegerType::getInt64Ty(ctx)
);


The compiler must also know that generated functions now use the G-machine struct rather than a stack struct:

    function_type = FunctionType::get(Type::getVoidTy(ctx), { gmachine_ptr_type }, false);


Since we still use some functions that require a stack and not a G-machine, we must have a way to get the stack from a G-machine. To do this, we create a new unwrap function, which uses LLVM’s GEP instruction to get a pointer to the G-machine’s stack field:

Value* llvm_context::unwrap_gmachine_stack_ptr(Value* g) {
auto offset_0 = create_i32(0);
return builder.CreateGEP(g, { offset_0, offset_0 });
}


We use this function elsewhere, such llvm_context::create_pop:

Value* llvm_context::create_pop(Function* f) {
auto pop_f = functions.at("stack_pop");
return builder.CreateCall(pop_f, { unwrap_gmachine_stack_ptr(f->arg_begin()) });
}


Finally, we want to make sure our generated functions don’t allocate nodes without tracking them with the G-machine. To do so, we modify all the create_* methods to require the G-machine function argument, and update the functions themselves to call gmachine_track. For example, here’s llvm_context::create_num:

Value* llvm_context::create_num(Function* f, Value* v) {
auto alloc_num_f = functions.at("alloc_num");
auto alloc_num_call = builder.CreateCall(alloc_num_f, { v });
return create_track(f, alloc_num_call);
}


Of course, this requires us to add a new create_track method to the llvm_context:

Value* llvm_context::create_track(Function* f, Value* v) {
auto track_f = functions.at("gmachine_track");
return builder.CreateCall(track_f, { f->arg_begin(), v });
}


This is good. Let’s now implement the actual mark-and-sweep algorithm in gmachine_gc:

void gmachine_gc(struct gmachine* g) {
for(size_t i = 0; i < g->stack.count; i++) {
gc_visit_node(g->stack.data[i]);
}

struct node_base** head_ptr = &g->gc_nodes;
while(*head_ptr) {
if((*head_ptr)->gc_reachable) {
(*head_ptr)->gc_reachable = 0;
head_ptr = &(*head_ptr)->gc_next;
} else {
struct node_base* to_free = *head_ptr;
*head_ptr = to_free->gc_next;
free_node_direct(to_free);
free(to_free);
g->gc_node_count--;
}
}
}

In the code above, we first iterate through the stack, calling gc_visit_node on every node that we encounter. The assumption is that once gc_visit_node is done, every node that can be reached has its gc_reachable field set to 1, and all the others have it set to 0.

Once we reach the end of the stack, we continue to the “sweep” phase, iterating through the linked list of nodes (held in the G-machine GC handle gc_nodes). For each node, if its gc_reachable flag is not set, we remove it from the linked list, and call free_node_direct on it. Otherwise (that is, if the flag is set), we clear it, so that the node can potentially be garbage collected in a future invocation of gmachine_gc.

gc_visit_node recursively marks a node and its children as reachable:

void gc_visit_node(struct node_base* n) {
if(n->gc_reachable) return;
n->gc_reachable = 1;

if(n->tag == NODE_APP) {
struct node_app* app = (struct node_app*) n;
gc_visit_node(app->left);
gc_visit_node(app->right);
} if(n->tag == NODE_IND) {
struct node_ind* ind = (struct node_ind*) n;
gc_visit_node(ind->next);
} if(n->tag == NODE_DATA) {
struct node_data* data = (struct node_data*) n;
struct node_base** to_visit = data->array;
while(*to_visit) {
gc_visit_node(*to_visit);
to_visit++;
}
}
}

This is possible with the node_data nodes because of the change we made to the gmachine_pack instruction earlier: now, the last element of the “packed” array is NULL, telling gc_visit_node that it has reached the end of the list of children.

free_node_direct performs a non-recursive deallocation of all the resources held by a particular node. So far, this is only needed for node_data nodes, since the arrays holding their children are dynamically allocated. Thus, the code for the function is pretty simple:

void free_node_direct(struct node_base* n) {
if(n->tag == NODE_DATA) {
free(((struct node_data*) n)->array);
}
}

### When to Collect

When should we run garbage collection? Initially, I tried running it after every call to unwind. However, this quickly proved impractical: the performance of all the programs in the language decreased by a spectacular amount. Programs like works1.txt and works2.txt would take tens of seconds to complete.

Instead of this madness, let’s settle for an approach common to many garbage collectors. Let’s perform garbage collection every time the amount of memory we’ve allocated doubles. Tracking when the amount of allocated memory doubles is the purpose of the gc_node_count and gc_threshold fields in the gmachine struct. The former field tracks how many nodes have been tracked by the garbage collector, and the latter holds the number of nodes the G-machine must reach before triggering garbage collection.

Since the G-machine is made aware of allocations by a call to the gmachine_track function, this is where we will attempt to perform garbage collection. We end up with the following code:

struct node_base* gmachine_track(struct gmachine* g, struct node_base* b) {
g->gc_node_count++;
b->gc_next = g->gc_nodes;
g->gc_nodes = b;

if(g->gc_node_count >= g->gc_node_threshold) {
uint64_t nodes_before = g->gc_node_count;
gc_visit_node(b);
gmachine_gc(g);
g->gc_node_threshold = g->gc_node_count * 2;
}

return b;
}


When a node is added to the GC handle, we increment the gc_node_count field. If the new value of this field exceeds the threshold, we perform garbage collection. There are cases in which this is fairly dangerous: for instance, gmachine_pack first moves all packed nodes into an array, then allocates a node_data node. This means that for a brief moment, the nodes stored into the new data node are inaccessible from the stack, and thus susceptible to garbage collection. To prevent situations like this, we run gc_visit_node on the node being tracked, marking it and its children as “reachable”. Finally, we set the next “free” threshold to double the number of currently allocated nodes.

This is about as much as we need to do. The change in this post was a major one, and required updating multiple files. As always, you’re welcome to check out the compiler source code for this post. To wrap up, let’s evaluate our change.

To especially stress the compiler, I came up with a prime number generator. Since booleans are not in the standard library, and since it isn’t possible to pattern match on numbers, my only option was the use Peano encoding. This effectively means that numbers are represented as linked lists, which makes garbage collection all the more important. The program is quite long, but you can find the entire code here.

When I ran the primes program compiled using the previous version of the compiler using time, I got the following output:

Maximum resident set size (kbytes): 935764
Minor (reclaiming a frame) page faults: 233642


In contrast, here is the output of time when running the same program compiled with the new version of the compiler:

Maximum resident set size (kbytes): 7448
Minor (reclaiming a frame) page faults: 1577


We have reduced maximum memory usage by a factor of 125, and the number of page faults by a factor of 148. That seems pretty good!

With this success, we end today’s post. As I mentioned before, we’re not done. The language is still clunky to use, and can benefit from let/in expressions and lambda functions. Furthermore, our language is currently monomorphic, and would be much better with polymorphism. Finally, to make our language capable of more-than-trivial work, we may want to implement Input/Output and strings. We tackle the largest of these features, polymorphism, in Part 10.