In the previous post, we covered tokenizing. We learned how to convert an input string into logical segments, and even wrote up a tokenizer to do it according to the rules of our language. Now, it’s time to make sense of the tokens, and parse our language.

The Theory

The rules to parse a language are more complicated than the rules for recognizing tokens. For instance, consider a simple language of a matching number of open and closed parentheses, like () and ((())). You can’t write a regular expression for it! We resort to a wider class of languages, called context free languages. These languages are ones that are matched by context free grammars. A context free grammar is a list of rules in the form of \(S \rightarrow \alpha\), where \(S\) is a nonterminal (conceptualy, a thing that expands into other things), and \(\alpha\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn’t expand into other things; for us, this is a token).

Let’s write a context free grammar (CFG from now on) to match our parenthesis language:

$$ \begin{align} S & \rightarrow ( S ) \\
S & \rightarrow () \end{align} $$

So, how does this work? We start with a “start symbol” nonterminal, which we usually denote as \(S\). Then, to get a desired string, we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get (), we start with \(S\) and replace it with the body of the second one of its rules. This gives us () right away. To get ((())), we have to do a little more work:

$$ S \rightarrow (S) \rightarrow ((S)) \rightarrow ((())) $$

In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse a nonterminal by trying to match each symbol in its body. In the rule \(S \rightarrow \alpha \beta \gamma\), it will first try to match \(\alpha\), then \(\beta\), and so on. If one of the three contains a nonterminal, it will attempt to parse that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar $$ \begin{align} S & \rightarrow Sa \\
S & \rightarrow a \end{align} $$ A top down parser will start with \(S\). It will then try the first rule, which starts with \(S\). So, dutifully, it will try to parse that \(S\). And to do that, it will once again try the first rule, and find that it starts with another \(S\)… This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules left recursive, and top-down parsers aren’t able to handle those grammars.

We could rewrite our grammar without using left-recursion, but we don’t want to. Instead, we’ll use a bottom up parser, using specifically the LALR(1) parsing algorithm. Here’s an example of how it works, using our left-recursive grammar. We start with our goal string, and a “dot” indicating where we are. At first, the dot is behind all the characters: $$ .aaa $$ We see nothing interesting on the left side of the dot, so we move (or shift) the dot forward by one character: $$ a.aa $$ Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \(S\) (the second one). So we reduce the thing on the left side of the dot, by replacing it with the left hand side of the rule (\(S\)): $$ S.aa $$ There’s nothing else we can do with the left side, so we shift again: $$ Sa.a $$ Great, we see another body on the left of the dot. We reduce it: $$ S.a $$ Just like before, we shift over the dot, and again, we reduce. We end up with our start symbol, and nothing on the right of the dot, so we’re done!

The Practice

In practice, we don’t want to just match a grammar. That would be like saying “yup, this is our language”. Instead, we want to create something called an abstract syntax tree, or AST for short. This tree captures the structure of our language, and is easier to work with than its textual representation. The structure of the tree we build will often mimic the structure of our grammar: a rule in the form \(S \rightarrow A B\) will result in a tree named “S”, with two children corresponding the trees built for A and B. Since an AST captures the structure of the language, we’ll be able to toss away some punctuation like , and (. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally, we will write our grammar ignoring whitespace, since our tokenizer conveniently throws that into the trash.

The grammar for arithmetic actually involves more effort than it would appear at first. We want to make sure that our parser respects the order of operations. This way, when we have our tree, it will immediately have the structure in which multiplication is done before addition. We do this by creating separate “levels” in our grammar, with one nonterminal matching addition and subtraction, while another nonterminal matches multiplication and division. We want the operation that has the least precedence to be higher in our tree than one of higher precedence. For instance, for 3+2*6, we want our tree to have + as the root, 3 as the left child, and the tree for 2*6 as the right child. Why? Because this tree represents “the addition of 3 and the result of multiplying 2 by 6”. If we had * be the root, we’d have a tree representing “the multiplication of the result of adding 3 to 2 and 6”, which is not what our expression means.

So, with this in mind, we want our rule for addition (represented with the nonterminal \(A_{add}\), to be matched first, and for its children to be trees created by the multiplication rule, \(A_{mult}\). So we write the following rules: $$ \begin{align} A_{add} & \rightarrow A_{add}+A_{mult} \\
A_{add} & \rightarrow A_{add}-A_{mult} \\
A_{add} & \rightarrow A_{mult} \end{align} $$ The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \(A_{add}\) on the left side of \(+\) and \(-\) in the body because we want to be able to parse strings like 1+2+3+4, which we want to view as ((1+2)+3)+4 (mostly because subtraction is left-associative). So, we want the top level of the tree to be the rightmost + or -, since that means it will be the “last” operation. You may be asking,

You define addition in terms of addition; how will parsing end? What if there’s no addition at all, like 2*6?

This is the purpose of the third rule, which serves to say “an addition expression can just be a multiplication, without any plusses or minuses.” Our rules for multiplication are very similar: $$ \begin{align} A_{mult} & \rightarrow A_{mult}*P \\
A_{mult} & \rightarrow A_{mult}/P \\
A_{mult} & \rightarrow P \end{align} $$

P, in this case, is an aplication (remember, application has higher precedence than any binary operator). Once again, if there’s no * or \, we simply fall through to a \(P\) nonterminal, representing application.

Application is refreshingly simple: $$ \begin{align} P & \rightarrow P B \\
P & \rightarrow B \end{align} $$ An application is either only one “thing” (represented with \(B\), for base), such as a number or an identifier, or another application followed by a thing.

We now need to define what a “thing” is. As we said before, it’s a number, or an identifier. We also make a parenthesized arithmetic expression a “thing”, allowing us to wrap right back around and recognize anything inside parentheses: $$ \begin{align} B & \rightarrow \text{num} \\
B & \rightarrow \text{lowerVar} \\
B & \rightarrow \text{upperVar} \\
B & \rightarrow ( A_{add} ) \\
B & \rightarrow C \end{align} $$ What’s the last \(C\)? We also want a “thing” to be a case expression. Here are the rules for that: $$ \begin{align} C & \rightarrow \text{case} \; A_{add} \; \text{of} \; \{ L_B\} \\
L_B & \rightarrow R \; L_B \\
L_B & \rightarrow R \\
R & \rightarrow N \; \text{arrow} \; \{ A_{add} \} \\
N & \rightarrow \text{lowerVar} \\
N & \rightarrow \text{upperVar} \; L_L \\
L_L & \rightarrow \text{lowerVar} \; L_L \\
L_L & \rightarrow \epsilon \end{align} $$ \(L_B\) is the list of branches in our case expression. \(R\) is a single branch, which is in the form Pattern -> Expression. \(N\) is a pattern, which we will for now define to be either a variable name (\(\text{lowerVar}\)), or a constructor with some arguments. The arguments of a constructor will be lowercase names, and a list of those arguments will be represented with \(L_L\). One of the bodies of this nonterminal is just the character \(\epsilon\), which just means “nothing”. We use this because a constructor can have no arguments (like Nil).

We can use these grammar rules to represent any expression we want. For instance, let’s try 3+(multiply 2 6), where multiply is a function that, well, multiplies. We start with \(A_{add}\): $$ \begin{align} & A_{add} \\
& \rightarrow A_{add} + A_{mult} \\
& \rightarrow A_{mult} + A_{mult} \\
& \rightarrow P + A_{mult} \\
& \rightarrow B + A_{mult} \\
& \rightarrow \text{num(3)} + A_{mult} \\
& \rightarrow \text{num(3)} + P \\
& \rightarrow \text{num(3)} + B \\
& \rightarrow \text{num(3)} + (A_{add}) \\
& \rightarrow \text{num(3)} + (A_{mult}) \\
& \rightarrow \text{num(3)} + (P) \\
& \rightarrow \text{num(3)} + (P \; \text{num(6)}) \\
& \rightarrow \text{num(3)} + (P \; \text{num(2)} \; \text{num(6)}) \\
& \rightarrow \text{num(3)} + (\text{lowerVar(multiply)} \; \text{num(2)} \; \text{num(6)}) \\
\end{align} $$

We’re almost there. We now want a rule for a “something that can appear at the top level of a program”, like a function or data type declaration. We make a new set of rules: $$ \begin{align} T & \rightarrow \text{defn} \; \text{lowerVar} \; L_P =\{ A_{add} \} \\
T & \rightarrow \text{data} \; \text{upperVar} = \{ L_D \} \\
L_D & \rightarrow D \; , \; L_D \\
L_D & \rightarrow D \\
L_P & \rightarrow \text{lowerVar} \; L_P \\
L_P & \rightarrow \epsilon \\
D & \rightarrow \text{upperVar} \; L_U \\
L_U & \rightarrow \text{upperVar} \; L_U \\
L_U & \rightarrow \epsilon \end{align} $$ That’s a lot of rules! \(T\) is the “top-level declaration rule. It matches either a function or a data definition. A function definition consists of the keyword “defn”, followed by a function name (starting with a lowercase letter), followed by a list of parameters, represented by \(L_P\).

A data type definition consists of the name of the data type (starting with an uppercase letter), and a list \(L_D\) of data constructors \(D\). There must be at least one data constructor in this list, so we don’t use the empty string rule here. A data constructor is simply an uppercase variable representing a constructor of the data type, followed by a list \(L_U\) of zero or more uppercase variables (representing the types of the arguments of the constructor).

Finally, we want one or more of these declarations in a valid program: $$ \begin{align} G & \rightarrow T \; G \\
G & \rightarrow T \end{align} $$

Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar. It’s called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we’ve already incurred - the implementation of our AST. Such a tree is language-specific, so Bison doesn’t generate it for us. Here’s what I came up with:

#pragma once
#include <memory>
#include <vector>

struct ast {
    virtual ~ast() = default;
};

using ast_ptr = std::unique_ptr<ast>;

struct pattern {
    virtual ~pattern() = default;
};

using pattern_ptr = std::unique_ptr<pattern>;

struct branch {
    pattern_ptr pat;
    ast_ptr expr;

    branch(pattern_ptr p, ast_ptr a)
        : pat(std::move(p)), expr(std::move(a)) {}
};

using branch_ptr = std::unique_ptr<branch>;

struct constructor {
    std::string name;
    std::vector<std::string> types;

    constructor(std::string n, std::vector<std::string> ts)
        : name(std::move(n)), types(std::move(ts)) {}
};

using constructor_ptr = std::unique_ptr<constructor>;

struct definition {
    virtual ~definition() = default;
};

using definition_ptr = std::unique_ptr<definition>;

enum binop {
    PLUS,
    MINUS,
    TIMES,
    DIVIDE
};

struct ast_int : public ast {
    int value;

    explicit ast_int(int v)
        : value(v) {}
};

struct ast_lid : public ast {
    std::string id;

    explicit ast_lid(std::string i)
        : id(std::move(i)) {}
};

struct ast_uid : public ast {
    std::string id;

    explicit ast_uid(std::string i)
        : id(std::move(i)) {}
};

struct ast_binop : public ast {
    binop op;  
    ast_ptr left;
    ast_ptr right;

    ast_binop(binop o, ast_ptr l, ast_ptr r)
        : op(o), left(std::move(l)), right(std::move(r)) {}
};

struct ast_app : public ast {
    ast_ptr left;
    ast_ptr right;

    ast_app(ast_ptr l, ast_ptr r)
        : left(std::move(l)), right(std::move(r)) {}
};

struct ast_case : public ast {
    ast_ptr of;
    std::vector<branch_ptr> branches;

    ast_case(ast_ptr o, std::vector<branch_ptr> b)
        : of(std::move(o)), branches(std::move(b)) {}
};

struct pattern_var : public pattern {
    std::string var;

    pattern_var(std::string v)
        : var(std::move(v)) {}
};

struct pattern_constr : public pattern {
    std::string constr;
    std::vector<std::string> params;

    pattern_constr(std::string c, std::vector<std::string> p)
        : constr(std::move(c)), params(std::move(p)) {}
};

struct definition_defn : public definition {
    std::string name;
    std::vector<std::string> params;
    ast_ptr body;

    definition_defn(std::string n, std::vector<std::string> p, ast_ptr b)
        : name(std::move(n)), params(std::move(p)), body(std::move(b)) {

    }
};

struct definition_data : public definition {
    std::string name;
    std::vector<constructor_ptr> constructors;

    definition_data(std::string n, std::vector<constructor_ptr> cs)
        : name(std::move(n)), constructors(std::move(cs)) {}
};
We create a base class for an expression tree, which we call ast. Then, for each possible syntactic construct in our language (a number, a variable, a binary operation, a case expression) we create a subclass of ast. The ast_case subclass is the most complex, since it must contain a list of case expression branches, which are a combination of a pattern and another expression.

Finally, we get to writing our Bison file, parser.y. Here’s what I come up with:

%{
#include <string>
#include <iostream>
#include "ast.hpp"
#include "parser.hpp"

std::vector<definition_ptr> program;
extern yy::parser::symbol_type yylex();

%}

%token PLUS
%token TIMES
%token MINUS
%token DIVIDE
%token <int> INT
%token DEFN
%token DATA
%token CASE
%token OF
%token OCURLY
%token CCURLY
%token OPAREN
%token CPAREN
%token COMMA
%token ARROW
%token EQUAL
%token <std::string> LID
%token <std::string> UID

%language "c++"
%define api.value.type variant
%define api.token.constructor

%type <std::vector<std::string>> lowercaseParams uppercaseParams
%type <std::vector<definition_ptr>> program definitions
%type <std::vector<branch_ptr>> branches
%type <std::vector<constructor_ptr>> constructors
%type <ast_ptr> aAdd aMul case app appBase
%type <definition_ptr> definition defn data 
%type <branch_ptr> branch
%type <pattern_ptr> pattern
%type <constructor_ptr> constructor

%start program

%%

program
    : definitions { program = std::move($1); }
    ;

definitions
    : definitions definition { $$ = std::move($1); $$.push_back(std::move($2)); }
    | definition { $$ = std::vector<definition_ptr>(); $$.push_back(std::move($1)); }
    ;

definition
    : defn { $$ = std::move($1); }
    | data { $$ = std::move($1); }
    ;

defn
    : DEFN LID lowercaseParams EQUAL OCURLY aAdd CCURLY
        { $$ = definition_ptr(
            new definition_defn(std::move($2), std::move($3), std::move($6))); }
    ;

lowercaseParams
    : %empty { $$ = std::vector<std::string>(); }
    | lowercaseParams LID { $$ = std::move($1); $$.push_back(std::move($2)); }
    ;

uppercaseParams
    : %empty { $$ = std::vector<std::string>(); }
    | uppercaseParams UID { $$ = std::move($1); $$.push_back(std::move($2)); }
    ;

aAdd
    : aAdd PLUS aMul { $$ = ast_ptr(new ast_binop(PLUS, std::move($1), std::move($3))); }
    | aAdd MINUS aMul { $$ = ast_ptr(new ast_binop(MINUS, std::move($1), std::move($3))); }
    | aMul { $$ = std::move($1); }
    ;

aMul
    : aMul TIMES app { $$ = ast_ptr(new ast_binop(TIMES, std::move($1), std::move($3))); }
    | aMul DIVIDE app { $$ = ast_ptr(new ast_binop(DIVIDE, std::move($1), std::move($3))); }
    | app { $$ = std::move($1); }
    ;

app
    : app appBase { $$ = ast_ptr(new ast_app(std::move($1), std::move($2))); }
    | appBase { $$ = std::move($1); }
    ;

appBase
    : INT { $$ = ast_ptr(new ast_int($1)); }
    | LID { $$ = ast_ptr(new ast_lid(std::move($1))); }
    | UID { $$ = ast_ptr(new ast_uid(std::move($1))); }
    | OPAREN aAdd CPAREN { $$ = std::move($2); }
    | case { $$ = std::move($1); }
    ;

case
    : CASE aAdd OF OCURLY branches CCURLY 
        { $$ = ast_ptr(new ast_case(std::move($2), std::move($5))); }
    ;

branches
    : branches branch { $$ = std::move($1); $1.push_back(std::move($2)); }
    | branch { $$ = std::vector<branch_ptr>(); $$.push_back(std::move($1));}
    ;

branch
    : pattern ARROW OCURLY aAdd CCURLY
        { $$ = branch_ptr(new branch(std::move($1), std::move($4))); }
    ;

pattern
    : LID { $$ = pattern_ptr(new pattern_var(std::move($1))); }
    | UID lowercaseParams
        { $$ = pattern_ptr(new pattern_constr(std::move($1), std::move($2))); }
    ;

data
    : DATA UID EQUAL OCURLY constructors CCURLY
        { $$ = definition_ptr(new definition_data(std::move($2), std::move($5))); }
    ;

constructors
    : constructors COMMA constructor { $$ = std::move($1); $$.push_back(std::move($3)); }
    | constructor
        { $$ = std::vector<constructor_ptr>(); $$.push_back(std::move($1)); }
    ;

constructor
    : UID uppercaseParams
        { $$ = constructor_ptr(new constructor(std::move($1), std::move($2))); }
    ;

There’s a few things to note here. First of all, the parser is the “source of truth” regarding what tokens exist in our language. We have a list of %token declarations, each of which corresponds to a regular expression in our scanner.

Next, observe that there’s a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression with an action, a C++ code snippet to be executed when a regular expression is matched. This same idea is present in the parser, too. Each rule can produce a value, which we call a semantic value. For type safety, we allow each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \(A_{add}\) must produce an expression. We specify the type of each nonterminal and using %type directives. The types of terminals are specified when they’re declared.

Next, we must recognize that Bison was originally made for C, rather than C++. In order to allow the parser to store and operate on semantic values of various types, the canonical solution back in those times was to use a C union. Unions are great, but for C++, they’re more trouble than they’re worth: unions don’t allow for non-trivial constructors! This means that stuff like std::unique_ptr and std::string is off limits as a semantic value. But we’d really much rather use them! The solution is to:

  1. Specify the language to be C++, rather than C.
  2. Enable the variant API feature, which uses a lightweight std::variant alternative in place of a union.
  3. Enable the creation of token constructors, which we will use in Flex.

In order to be able to use the variant-based API, we also need to change the Flex yylex function to return yy::parser::symbol_type. You can see it in our forward declaration of yylex.

Now that we made these changes, it’s time to hook up Flex to all this. Here’s a new version of the Flex scanner, with all necessary modifications:

%option noyywrap

%{
#include <iostream>
#include "ast.hpp"
#include "parser.hpp"

#define YY_DECL yy::parser::symbol_type yylex()

%}

%%

[ \n]+ {}
\+ { return yy::parser::make_PLUS(); }
\* { return yy::parser::make_TIMES(); }
- { return yy::parser::make_MINUS(); }
\/ { return yy::parser::make_DIVIDE(); }
[0-9]+ { return yy::parser::make_INT(atoi(yytext)); }
defn { return yy::parser::make_DEFN(); }
data { return yy::parser::make_DATA(); }
case { return yy::parser::make_CASE(); }
of { return yy::parser::make_OF(); }
\{ { return yy::parser::make_OCURLY(); }
\} { return yy::parser::make_CCURLY(); }
\( { return yy::parser::make_OPAREN(); }
\) { return yy::parser::make_CPAREN(); }
,  { return yy::parser::make_COMMA(); }
-> { return yy::parser::make_ARROW(); }
= { return yy::parser::make_EQUAL(); }
[a-z][a-zA-Z]* { return yy::parser::make_LID(std::string(yytext)); }
[A-Z][a-zA-Z]* { return yy::parser::make_UID(std::string(yytext)); }

%%

The key two ideas are that we overrode the default signature of yylex by changing the YY_DECL preprocessor variable, and used the yy::parser::make_<TOKEN> functions to return the symbol_type rather than int.

Finally, let’s get a main function so that we can at least check for segmentation faults and other obvious mistakes:

#include "ast.hpp"
#include "parser.hpp"

void yy::parser::error(const std::string& msg) {
    std::cout << "An error occured: " << msg << std::endl;
}

extern std::vector<definition_ptr> program;

int main() {
    yy::parser parser;
    parser.parse();
    std::cout << program.size() << std::endl;
}

Now, we can compile and run the code:

bison -o parser.cpp -d parser.y
flex -o scanner.cpp scanner.l
g++ -c -o scanner.o scanner.cpp
g++ -c -o parser.o parser.cpp
g++ main.cpp parser.o scanner.o
We used the -d option for Bison to generate the compiler_parser.hpp header file, which exports our token declarations and token creation functions, allowing us to use them in Flex.

At last, we can feed some code to the parser from stdin. Let’s try it:

./a.out
defn main = { add 320 6 }
defn add x y = { x + y }

The program prints 2, indicating two declarations were made. Let’s try something obviously wrong:

./a.out
}{

We are told an error occured. Excellent! We’re not really sure what our tree looks like, though. We just know there’s stuff in the list of definitions. Having read our source code into a structure we’re more equipped to handle, we can now try to verify that the code makes sense in Part 3 - Type Checking.