During my last academic term, I was enrolled in a compilers course. We had a final project - develop a compiler for a basic Python subset, using LLVM. It was a little boring - virtually nothing about the compiler was not covered in class, and it felt more like putting two puzzle pieces together than building a real project.
Instead, I chose to implement a compiler for a functional programming language, based on a wonderful book by Simon Peyton Jones, Implementing functional languages: a tutorial. Since the class was requiring the use of tools based on C++, that’s what I used for my compiler. It was neat little project, and I wanted to share with everyone else how one might go about writing their own functional language.
Motivation
There are two main motivating factors for this series.
First, whenever I stumble on a compiler implementation tutorial, the language created is always imperative, inspired by C, C++, JavaScript, or Python. There are many interesting things about compiling such a language. However, I also think that the compilation of a functional language (including features like lazy evaluation) is interesting enough, and rarely covered.
Second, I’m inspired by books such as Software Foundations that use source code as their text. The entire content of Software Foundations, for instance, is written as comments in Coq source file. This means that you can not only read the book, but also run the code and interact with it. This makes it very engaging to read. Because of this, I want to provide for each post a “snapshot” of the project code. All the code in the posts will directly mirror that snapshot. The code you’ll be reading will be runnable and open.
Overview
Let’s go over some preliminary information before we embark on this journey.
The “classic” stages of a compiler
Let’s take a look at the high level overview of what a compiler does. Conceptually, the components of a compiler are pretty cleanly separated. They are as follows:
- Tokenizing / lexical analysis
- Parsing
- Analysis / optimization
- Code Generation
There are many variations on this structure. Some compilers don’t optimize at all, some translate the program text into an intermediate representation, an alternative way of representing the program that isn’t machine code. In some compilers, the stages of parsing and analysis can overlap. In short, just like the pirate’s code, it’s more of a guideline than a rule.
What we will cover
We’ll go through the stages of a compiler, starting from scratch and building up our project. We’ll cover:
- Tokenizing using regular expressions and Flex.
- Parsing using context free grammars and Bison.
- Monomorphic type checking (including typing rules).
- Evaluation using graph reduction and the G-Machine.
- Compiling G-Machine instructions to machine code using LLVM.
We’ll be creating a lazily evaluated, functional language.
What we won’t cover
Surely a guide written by one person can’t be comprehensive. Not only do I have a finite amount of time to share this informatiom with you, but I also don’t want to dilute the content of the posts in this series. Furthermore, many things that we’ll be using for this tutorial are taught by numerous other sources, and those sources do a better job than I would. So, here are some things that you might want to know for this series, which won’t be covered by the series itself:
- Theory of computation, or, more specifically, automata theory. Deterministic and nondeterministic finite automata are briefly mentioned during tokenizing, and context free grammars are used in our parser. However, I don’t provide a real explanation for either of those things.
- Functional programming, with a touch of lambda calculus. We jump straight into implementing the concepts from these.
- C++. I do my best to use correct and modern C++, but I’m not an expert. I will not explain the syntax / semantics of C++ code included in these posts, but I will explain what my code does in the context of compilers.
The syntax of our language
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks that most functional languages are very similar, and vary largely in syntax. That’s our main degree of freedom. We want to represent the following things, for sure:
- Defining functions
- Applying functions
- Arithmetic
- Algebraic data types (to represent lists, pairs, and the like)
- Pattern matching (to operate on data types)
We can additionally support anonymous (lambda) functions, but compiling those
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
define - let’s define it as we would expect: 3
is a number, 3+2*6
evaluates to 15.
Function application isn’t much more difficult - f x
means “apply f to x”, and
f x + g x
means sum the result of applying f to x and g to x. That is, function
application has higher precedence, or binds tighter than binary operators like plus.
Next, let’s define the syntax for declaring a function. Why not:
defn f x = { x + x }
As for declaring data types:
data List = { Nil, Cons Int List }
Notice that we are avoiding polymorphism here.
Let’s also define a syntax for pattern matching:
case l of {
Nil -> { 0 }
Cons x xs -> { x }
}
The above means “if the list l
is Nil
, then return 0, otherwise, if it’s
constructed from an integer and another list (as defined in our data
example),
return the integer”.
That’s it for the introduction! In the next post, we’ll cover tokenizng, which is the first step in converting source code into an executable program.
Navigation
Here are the posts that I’ve written so far for this series: