A Typesafe Representation of an Imperative Language

Table of Contents

A recent homework assignment for my university’s programming languages course was to encode the abstract syntax for a small imperative language into Haskell data types. The language consisted of very few constructs, and was very much a “toy”. On the expression side of things, it had three registers (A, B, and R), numbers, addition, comparison using “less than”, and logical negation. It also included a statement for storing the result of an expression into a register, if/else, and an infinite loop construct with an associated break operation. A sample program in the language which computes the product of two numbers is as follows:

A := 7
B := 9
R := 0
do
  if A <= 0 then
    break
  else
    R := R + B;
    A := A + -1;
  end
end

The homework notes that type errors may arise in the little imperative language. We could, for instance, try to add a boolean to a number: 3 + (1 < 2). Alternatively, we could try use a number in the condition of an if/else expression. A “naive” encoding of the abstract syntax would allow for such errors.

However, assuming that registers could only store integers and not booleans, it is fairly easy to separate the expression grammar into two nonterminals, yielding boolean and integer expressions respectively. Since registers can only store integers, the (:=) operation will always require an integer expression, and an if/else statement will always require a boolean expression. A matching Haskell encoding would not allow “invalid” programs to compile. That is, the programs would be type-correct by construction.

Then, a question arose in the ensuing discussion: what if registers could contain booleans? It would be impossible to create such a “correct-by-construction” representation then, wouldn’t it? Although I don’t know about Haskell, [note: I am pretty certain that a similar encoding in Haskell is possible. However, Haskell wasn't originally created for that kind of abuse of its type system, so it would probably not look very good. ] I am sure that it is possible to do this in Idris, a dependently typed programming language. In this post I will talk about how to do that.

Registers and Expressions

Let’s start by encoding registers. Since we only have three registers, we can encode them using a simple data type declaration, much the same as we would in Haskell:

From TypesafeImp.idr, line 1

1

data Reg = A | B | R

Now that registers can store either integers or booleans (and only those two), we need to know which one is which. For this purpose, we can declare another data type:

From TypesafeImp.idr, line 3

3

data Ty = IntTy | BoolTy

At any point in the (hypothetical) execution of our program, each of the registers will have a type, either boolean or integer. The combined state of the three registers would then be the combination of these three states; we can represent this using a 3-tuple:

From TypesafeImp.idr, lines 5 through 6

5
6

TypeState : Type
TypeState = (Ty, Ty, Ty)

Let’s say that the first element of the tuple will be the type of the register A, the second the type of B, and the third the type of R. Then, we can define two helper functions, one for retrieving the type of a register, and one for changing it:

From TypesafeImp.idr, lines 8 through 16



 8
 9
10
11
12
13
14
15
16



getRegTy : Reg -> TypeState -> Ty
getRegTy A (a, _, _) = a
getRegTy B (_, b, _) = b
getRegTy R (_, _, r) = r

setRegTy : Reg -> Ty -> TypeState -> TypeState
setRegTy A a (_, b, r) = (a, b, r)
setRegTy B b (a, _, r) = (a, b, r)
setRegTy R r (a, b, _) = (a, b, r)

Now, it’s time to talk about expressions. We know now that an expression can evaluate to either a boolean or an integer value (because a register can contain either of those types of values). Perhaps we can specify the type that an expression evaluates to in the expression’s own type: Expr IntTy would evaluate to integers, and Expr BoolTy would evaluate to booleans. Then, we could have constructors as follows:

Lit : Int -> Expr IntTy
Not : Expr BoolTy -> Expr BoolTy

Sounds good! But what about loading a register?

Load : Reg -> Expr IntTy -- no; what if the register is a boolean?
Load : Reg -> Expr BoolTy -- no; what if the register is an integer?
Load : Reg -> Expr a -- no; a register access can't be either!

The type of an expression that loads a register depends on the current state of the program! If we last stored an integer into a register, then loading from that register would give us an integer. But if we last stored a boolean into a register, then reading from it would give us a boolean. Our expressions need to be aware of the current types of each register. To do so, we add the state as a parameter to our Expr type. This would lead to types like the following:

-- An expression that produces a boolean when all the registers
-- are integers.
Expr (IntTy, IntTy, IntTy) BoolTy

-- An expression that produces an integer when A and B are integers,
-- and R is a boolean.
Expr (IntTy, IntTy, BoolTy) IntTy

In Idris, the whole definition becomes:

From TypesafeImp.idr, lines 18 through 23



18
19
20
21
22
23



data Expr : TypeState -> Ty -> Type where
  Lit : Int -> Expr s IntTy
  Load : (r : Reg) -> Expr s (getRegTy r s)
  Add : Expr s IntTy -> Expr s IntTy -> Expr s IntTy
  Leq : Expr s IntTy -> Expr s IntTy -> Expr s BoolTy
  Not : Expr s BoolTy -> Expr s BoolTy

The only “interesting” constructor is Load, which, given a register r, creates an expression that has r’s type in the current state s.

Statements

Statements are a bit different. Unlike expressions, they don’t evaluate to anything; rather, they do something. That “something” may very well be changing the current state. We could, for instance, set A to be a boolean, while it was previously an integer. This suggests equipping our Stmt type with two arguments: the initial state (before the statement’s execution), and the final state (after the statement’s execution). This would lead to types like this:

-- Statement that, when run while all registers contain integers,
-- terminates with registers B and R having been assigned boolean values.
Stmt (IntTy, IntTy, IntTy) (IntTy, BoolTy, BoolTy)

However, there’s a problem with loop and break. When we run a loop, we will require that the state at the end of one iteration is the same as the state at its beginning. Otherwise, it would be possible for a loop to keep changing the types of registers every iteration, and it would become impossible for us to infer the final state without actually running the program. In itself, this restriction isn’t a problem; most static type systems require both branches of an if/else expression to be of the same type for a similar reason. The problem comes from the interaction with break.

By itself, the would-be type of break seems innocent enough. It doesn’t change any registers, so we could call it Stmt s s. But consider the following program:

A := 0
B := 0
R := 0
do
  if 5 <= A then
    B := 1 <= 1
    break
    B := 0
  else
    A := A + 1
  end
end

The loop starts with all registers having integer values. As per our aforementioned loop requirement, the body of the loop must terminate with all registers still having integer values. For the first five iterations that’s exactly what will happen. However, after we increment A the fifth time, we will set B to a boolean value – using a valid statement – and then break. The break statement will be accepted by the typechecker, and so will the whole then branch. After all, it seems as though we reset B back to an integer value. But that won’t be the case. We will have jumped to the end of the loop, where we are expected to have an all-integer type, which we will not have.

The solution I came up with to address this issue was to add a third argument to Stmt, which contains the “context” type. That is, it contains the type of the innermost loop surrounding the statement. A break statement would only be permissible if the current type matches the loop type. With this, we finally write down a definition of Stmt:

From TypesafeImp.idr, lines 26 through 30



26
27
28
29
30



  data Stmt : TypeState -> TypeState -> TypeState -> Type where
    Store : (r : Reg) -> Expr s t -> Stmt l s (setRegTy r t s)
    If : Expr s BoolTy -> Prog l s n -> Prog l s n -> Stmt l s n
    Loop : Prog s s s -> Stmt l s s
    Break : Stmt s s s

The Store constructor takes a register r and an expression producing some type t in state s. From these, it creates a statement that starts in s, and finishes in a state similar to s, but with r now having type t. The loop type l remains unaffected and unused; we are free to assign any register any value.

The If constructor takes a condition Expr, which starts in state s and must produce a boolean. It also takes two programs (sequences of statements), each of which starts in s and finishes in another state n. This results in a statement that starts in state s, and finishes in state n. Conceptually, each branch of the if/else statement must result in the same final state (in terms of types); otherwise, we wouldn’t know which of the states to pick when deciding the final state of the If itself. As with Store, the loop type l is untouched here. Individual statements are free to modify the state however they wish.

The Loop constructor is very restrictive. It takes a single program (the sequence of instructions that it will be repeating). As we discussed above, this program must start and end in the same state s. Furthermore, this program’s loop type must also be s, since the loop we’re constructing will be surrounding the program. The resulting loop itself still has an arbitrary loop type l, since it doesn’t surround itself.

Finally, Break can only be constructed when the loop state matches the current state. Since we’ll be jumping to the end of the innermost loop, the final state is also the same as the loop state.

These are all the constructors we’ll be needing. It’s time to move on to whole programs!

Programs

A program is simply a list of statements. However, we can’t use a regular Idris list, because a regular list wouldn’t be able to represent the relationship between each successive statement. In our program, we want the final state of one statement to be the initial state of the following one, since they’ll be executed in sequence. To represent this, we have to define our own list-like GADT. The definition of the type turns out fairly straightforward:

From TypesafeImp.idr, lines 32 through 34

32
33
34

  data Prog : TypeState -> TypeState -> TypeState -> Type where
    Nil : Prog l s s
    (::) : Stmt l s n -> Prog l n m -> Prog l s m

The Nil constructor represents an empty program (much like the built-in Nil represents an empty list). Since no actions are done, it creates a Prog that starts and ends in the same state: s. The (::) constructor, much like the built-in (::) constructor, takes a statement and another program. The statement begins in state s and ends in state n; the program after that statement must then start in state n, and end in some other state m. The combination of the statement and the program starts in state s, and finishes in state m. Thus, (::) yields Prog s m. None of the constructors affect the loop type l: we are free to sequence any statements that we want, and it is impossible for us to construct statements using l that cause runtime errors.

This should be all! Let’s try out some programs.

Trying it Out

The following (type-correct) program compiles just fine:

From TypesafeImp.idr, lines 36 through 47



36
37
38
39
40
41
42
43
44
45
46
47



initialState : TypeState
initialState = (IntTy, IntTy, IntTy)

testProg : Prog Main.initialState Main.initialState Main.initialState
testProg =
  [ Store A (Lit 1 `Leq` Lit 2)
  , If (Load A)
    [ Store A (Lit 1) ]
    [ Store A (Lit 2) ]
  , Store B (Lit 2)
  , Store R (Add (Load A) (Load B))
  ]

First, it loads a boolean into register A; then, inside the if/else statement, it stores an integer into A. Finally, it stores another integer into B, and adds them into R. Even though A was a boolean at first, the type checker can deduce that it was reset back to an integer after the if/else, and the program is accepted. On the other hand, had we forgotten to set A to a boolean first:

  [ If (Load A)
    [ Store A (Lit 1) ]
    [ Store A (Lit 2) ]
  , Store B (Lit 2)
  , Store R (Add (Load A) (Load B))
  ]

We would get a type error:

Type mismatch between getRegTy A (IntTy, IntTy, IntTy) and BoolTy

The type of register A (that is, IntTy) is incompatible with BoolTy. Our initialState says that A starts out as an integer, so it can’t be used in an if/else right away! Similar errors occur if we make one of the branches of the if/else empty, or if we set B to a boolean.

We can also encode the example program from the beginning of this post:

From TypesafeImp.idr, lines 49 through 61



49
50
51
52
53
54
55
56
57
58
59
60
61



prodProg : Prog Main.initialState Main.initialState Main.initialState
prodProg =
  [ Store A (Lit 7)
  , Store B (Lit 9)
  , Store R (Lit 0)
  , Loop
    [ If (Load A `Leq` Lit 0)
      [ Break ]
      [ Store R (Load R `Add` Load B)
      , Store A (Load A `Add` Lit (-1))
      ]
    ]
  ]

This program compiles just fine, too! It is a little reminiscent of the program we used to demonstrate how break could break things if we weren’t careful. So, let’s go ahead and try break in an invalid state:

  [ Store A (Lit 7)
  , Store B (Lit 9)
  , Store R (Lit 0)
  , Loop
    [ If (Load A `Leq` Lit 0)
      [ Store B (Lit 1 `Leq` Lit 1), Break, Store B (Lit 0) ]
      [ Store R (Load R `Add` Load B)
      , Store A (Load A `Add` Lit (-1))
      ]
    ]
  ]

Again, the type checker complains:

Type mismatch between IntTy and BoolTy

And so, we have an encoding of our language that allows registers to be either integers or booleans, while still preventing type-incorrect programs!

Building an Interpreter

A good test of such an encoding is the implementation of an interpreter. It should be possible to convince the typechecker that our interpreter doesn’t need to handle type errors in the toy language, since they cannot occur.

Let’s start with something simple. First of all, suppose we have an expression of type Expr InTy. In our toy language, it produces an integer. Our interpreter, then, will probably want to use Idris’ type Int. Similarly, an expression of type Expr BoolTy will produce a boolean in our toy language, which in Idris we can represent using the built-in Bool type. Despite the similar naming, though, there’s no connection between Idris’ Bool and our own BoolTy. We need to define a conversion from our own types – which are values of type Ty – into Idris types that result from evaluating expressions. We do so as follows:

From TypesafeImp.idr, lines 63 through 65

63
64
65

repr : Ty -> Type
repr IntTy = Int
repr BoolTy = Bool

Similarly, we want to convert our TypeState (a tuple describing the types of our registers) into a tuple that actually holds the values of each register, which we will call State. The value of each register at any point depends on its type. My first thought was to define State as a function from TypeState to an Idris Type:

State : TypeState -> Type
State (a, b, c) = (repr a, repr b, repr c)

Unfortunately, this doesn’t quite cut it. The problem is that this function technically doesn’t give Idris any guarantees that State will be a tuple. The most Idris knows is that State will be some Type, which could be Int, Bool, or anything else! This becomes a problem when we try to pattern match on states to get the contents of a particular register. Instead, we have to define a new data type:

From TypesafeImp.idr, lines 67 through 68

67
68

data State : TypeState -> Type where
  MkState : (repr a, repr b, repr c) -> State (a, b, c)

In this snippet, State is still a (type level) function from TypeState to Type. However, by using a GADT, we guarantee that there’s only one way to construct a State (a,b,c): using a corresponding tuple. Now, Idris will accept our pattern matching:

From TypesafeImp.idr, lines 70 through 78



70
71
72
73
74
75
76
77
78



getReg : (r : Reg) -> State s -> repr (getRegTy r s)
getReg A (MkState (a, _, _)) = a
getReg B (MkState (_, b, _)) = b
getReg R (MkState (_, _, r)) = r

setReg : (r : Reg) -> repr t -> State s -> State (setRegTy r t s)
setReg A a (MkState (_, b, r)) = MkState (a, b, r)
setReg B b (MkState (a, _, r)) = MkState (a, b, r)
setReg R r (MkState (a, b, _)) = MkState (a, b, r)

The getReg function will take out the value of the corresponding register, returning Int or Bool depending on the TypeState. What’s important is that if the TypeState is known, then so is the type of getReg: no Either is involved here, and we can directly use the integer or boolean stored in the register. This is exactly what we do:

From TypesafeImp.idr, lines 80 through 85



80
81
82
83
84
85



expr : Expr s t -> State s -> repr t
expr (Lit i) _ = i
expr (Load r) s = getReg r s
expr (Add l r) s = expr l s + expr r s
expr (Leq l r) s = expr l s <= expr r s
expr (Not e) s = not $ expr e s

This is pretty concise. Idris knows that Lit i is of type Expr IntTy, and it knows that repr IntTy = Int, so it also knows that eval (Lit i) produces an Int. Similarly, we wrote Reg r to have type Expr s (getRegTy r s). Since getReg returns repr (getRegTy r s), things check out here, too. A similar logic applies to the rest of the cases.

The situation with statements is somewhat different. As we said, a statement doesn’t return a value; it changes state. A good initial guess would be that to evaluate a statement that starts in state s and terminates in state n, we would take as input State s and return State n. However, things are not quite as simple, thanks to Break. Not only does Break require special case logic to return control to the end of the Loop, but it also requires some additional consideration: in a statement of type Stmt l s n, evaluating Break can return State l.

To implement this, we’ll use the Either type. The Left constructor will be contain the state at the time of evaluating a Break, and will indicate to the interpreter that we’re breaking out of a loop. On the other hand, the Right constructor will contain the state as produced by all other statements, which would be considered the “normal” case. [note: We use Left for the "abnormal" case because of Idris' (and Haskell's) convention to use it as such. For instance, the two languages define a Monad instance for Either a where (>>=) behaves very much like it does for Maybe, with Left being treated as Nothing, and Right as Just. We will use this instance to clean up some of our computations. ] Note that this doesn’t indicate an error: we need to represent the two states (breaking out of a loop and normal execution) to define our language’s semantics.

From TypesafeImp.idr, lines 88 through 95



88
89
90
91
92
93
94
95



  stmt : Stmt l s n -> State s -> Either (State l) (State n)
  stmt (Store r e) s = Right $ setReg r (expr e s) s
  stmt (If c t e) s = if expr c s then prog t s else prog e s
  stmt (Loop p) s =
    case prog p s >>= stmt (Loop p) of
      Right s => Right s
      Left s => Right s
  stmt Break s = Left s

First, note the type. We return an Either value, which will contain State l (in the Left constructor) if a Break was evaluated, and State n (in the Right constructor) if execution went on without breaking.

The Store case is rather simple. We use setReg to update the result of the register r with the result of evaluating e. Because a store doesn’t cause us to start breaking out of a loop, we use Right to wrap the new state.

The If case is also rather simple. Its condition is guaranteed to evaluate to a boolean, so it’s sufficient for us to use Idris’ if expression. We use the prog function here, which implements the evaluation of a whole program. We’ll get to it momentarily.

Loop is the most interesting case. We start by evaluating the program p serving as the loop body. The result of this computation will be either a state after a break, held in Left, or as the normal execution state, held in Right. The (>>=) operation will do nothing in the first case, and feed the resulting (normal) state to stmt (Loop p) in the second case. This is exactly what we want: if we broke out of the current iteration of the loop, we shouldn’t continue on to the next iteration. At the end of evaluating both p and the recursive call to stmt, we’ll either have exited normally, or via breaking out. We don’t want to continue breaking out further, so we return the final state wrapped in Right in both cases. Finally, Break returns the current state wrapped in Left, beginning the process of breaking out.

The task of prog is simply to sequence several statements together. The monadic bind operator, (>>=), is again perfect for this task, since it “stops” when it hits a Left, but continues otherwise. This is the implementation:

From TypesafeImp.idr, lines 97 through 99

97
98
99

  prog : Prog l s n -> State s -> Either (State l) (State n)
  prog Nil s = Right s
  prog (st::p) s = stmt st s >>= prog p

Awesome! Let’s try it out, shall we? I defined a quick run function as follows:

From TypesafeImp.idr, lines 101 through 102

101
102

run : Prog l s l -> State s -> State l
run p s = either id id $ prog p s

We then have:

*TypesafeImp> run prodProg (MkState (0,0,0))
MkState (0, 9, 63) : State (IntTy, IntTy, IntTy)

This seems correct! The program multiplies seven by nine, and stops when register A reaches zero. Our test program runs, too:

*TypesafeImp> run testProg (MkState (0,0,0))
MkState (1, 2, 3) : State (IntTy, IntTy, IntTy)

This is the correct answer: A ends up being set to 1 in the then branch of the conditional statement, B is set to 2 right after, and R, the sum of A and B, is rightly 3.

As you can see, we didn’t have to write any error handling code! This is because the typechecker knows that type errors aren’t possible: our programs are guaranteed to be type correct. [note: Our programs aren't guaranteed to terminate: we're lucky that Idris' totality checker is turned off by default. ] This was a fun exercise, and I hope you enjoyed reading along! I hope to see you in my future posts.

Daniel's Blog