On a rainy Oregon day, I was walking between classes with a group of friends. We were discussing the various ways to obfuscate solutions to the weekly homework assignments in our Algorithms course: replace every if with a ternary expression, use single variable names, put everything on one line. I said:

The [note: This is in reference to a meme, Virgin vs Chad. A "chad" characteristic is masculine or "alpha" to the point of absurdity. ] move would be to make your own, different language for every homework assignment.

It was required of us to use [note: A friend suggested making a Haskell program that generates Python-based interpreters for languages. While that would be truly absurd, I'll leave this challenge for another day. ] for our solutions, so that was the first limitation on this challenge. Someone suggested to write the languages in Haskell, since that’s what we used in our Programming Languages class. So the final goal ended up:

It may not be worth it to create a whole [note: A general purpose language is one that's designed to be used in various domains. For instance, C++ is a general-purpose language because it can be used for embedded systems, GUI programs, and pretty much anything else. This is in contrast to a domain-specific language, such as Game Maker Language, which is aimed at a much narrower set of uses. ] language for each problem, but nowhere in the challenge did we say that it had to be general-purpose. In fact, some interesting design thinking can go into designing a domain-specific language for a particular assignment. So let’s jump right into it, and make a language for the first homework assignment.

Homework 1

There are two problems in Homework 1. Here they are, verbatim:

   Quickselect with Randomized Pivot (CLRS Ch. 9.2).

   >>> from qselect import *
   >>> qselect(2, [3, 10, 4, 7, 19])
   4
   >>> qselect(4, [11, 2, 8, 3])
   11

And the second:

   In the slides we showed a buggy version of qsort which is weird in an interesting way:
   it actually returns a binary search tree for the given array, rooted at the pivot:

   >>> from qsort import *
   >>> tree = sort([4,2,6,3,5,7,1,9])
   >>> tree
   [[[[], 1, []], 2, [[], 3, []]], 4, [[[], 5, []], 6, [[], 7, [[], 9, []]]]]

   which encodes a binary search tree:

                      4
                    /   \
                  2       6
                 / \     / \
                1   3   5   7
                             \
                              9
   
   Now on top of that piece of code, add three functions: 
   * sorted(t): returns the sorted order (infix traversal)
   * search(t, x): returns whether x is in t
   * insert(t, x): inserts x into t (in-place) if it is missing, otherwise does nothing.

We want to make a language specifically for these two tasks (one of which is split into many tasks). What common things can we isolate? I see two:

First, all the problems deal with lists. This may seem like a trivial observation, but these two problems are the only thing we use our language for. We have list access, [note: Quickselect is a variation on quicksort, which itself finds all the "lesser" and "greater" elements in the input array. ] and list creation. That should serve as a good base!

If you squint a little bit, all the problems are recursive with the same base case. Consider the first few lines of search, implemented naively:

def search(xs, k):
    if xs == []:
        return false

How about sorted? Take a look:

def sorted(xs):
    if xs == []:
        return []

I’m sure you see the picture. But it will take some real mental gymnastics to twist the rest of the problems into this shape. What about qselect, for instance? There’s two cases for what it may return:

The test cases never provide a concrete example of what should be returned from qselect in the first case, so we’ll interpret it like [note: For a quick sidenote about undefined behavior, check out how C++ optimizes the Collatz Conjecture function. Clang doesn't know whether or not the function will terminate (whether the Collatz Conjecture function terminates is an unsolved problem), but functions that don't terminate are undefined behavior. There's only one other way the function returns, and that's with "1". Thus, clang optimizes the entire function to a single "return 1" call. ] in C++: we can do whatever we want. So, let’s allow it to return [] in the None case. This makes this base case valid:

def qselect(xs, k):
    if xs == []:
        return []

“Oh yeah, now it’s all coming together.” With one more observation (which will come from a piece I haven’t yet shown you!), we’ll be able to generalize this base case.

The observation is this section in the assignment:

   Hint: both search and insert should depend on a helper function _search(tree, x) which 
   returns the subtree (a list) rooted at x when x is found, or the [] where x should 
   be inserted.

   e.g., 
   >>> tree = sort([4,2,6,3,5,7,1,9])        # starting from the initial tree
   >>> _search(tree, 3)
   [[], 3, []]
   >>> _search(tree, 0)
   []
   >>> _search(tree, 6.5)
   []
   >>> _search(tree, 0) is _search(tree, 6.5)
   False
   >>> _search(tree, 0) == _search(tree, 6.5)
   True

The real key is the part about “returning the [] where x should be inserted”. It so happens that when the list given to the function is empty, the number should be inserted precisely into that list. Thus:

def _search(xs, k):
    if xs == []:
        return xs

The same works for qselect:

def qselect(xs, k):
    if xs == []:
        return xs

And for sorted, too:

def sorted(xs):
    if xs == []:
        return xs

There are some functions that are exceptions, though:

def insert(xs, k):
    # We can't return early here!
    # If we do, we'll never insert anything.

Also:

def search(xs, k):
    # We have to return true or false, never
    # an empty list.

So, whenever we don’t return a list, we don’t want to add a special case. We arrive at the following common base case: whenever a function returns a list, if its first argument is the empty list, the first argument is immediately returned.

We’ve largely exhasuted the conclusiosn we can draw from these problems. Let’s get to designing a language.

A Silly Language

Let’s start by visualizing our goals. Without base cases, the solution to _search would be something like this:

_search(xs, k) =
    if xs[1] == k then xs
    else if xs[1] > k then _search(xs[0], k)
    else _search(xs[2], k);

Here we have an if-expression. It has to have an else, and evaluates to the value of the chosen branch. That is, if true then 0 else 1 evaluates to 0, while if false then 0 else 1 evaluates to 1. Otherwise, we follow the binary tree search algorithm faithfully.

Using this definition of _search, we can define search pretty easily:

search(xs, k) = |_search(xs, k)| != 0;

Let’s use Haskell’s (++) operator for concatentation. This will help us understand when the user is operating on lists, and when they’re not. With this, sorted becomes:

sorted(xs) = sorted(xs[0]) ++ [xs[1]] ++ sorted(xs[2]);

Let’s go for qselect now. We’ll introduce a very silly language feature for this problem: [note: You've probably never heard of list selectors, and for a good reason: this is a terrible language feature. I'll go in more detail later, but I wanted to make this clear right away. ] . We observe that qselect aims to partition the list into other lists. We thus add the following pieces of syntax:

~xs -> {
    pivot <- xs[rand]!
    left <- xs[#0 <= pivot]
    ...
} -> ...

There are three new things here.

  1. The actual “list selector”: ~xs -> { .. } -> .... Between the curly braces are branches which select parts of the list and assign them to new variables. Thus, pivot <- xs[rand]! assigns the element at a random index to the variable pivot. the ! at the end means “after taking this out of xs, delete it from xs”. The syntax [note: An observant reader will note that there's no need for the "xs" after the "~". The idea was to add a special case syntax to reference the "selected list", but I ended up not bothering. So in fact, this part of the syntax is useless. ] to make it easier to parse.
  2. The rand list access syntax. xs[rand] is a special case that picks a random element from xs.
  3. The xs[#0 <= pivot] syntax. This is another special case that selects all elements from xs that match the given predicate (where #0 is replaced with each element in xs).

The big part of qselect is to not evaluate right unless you have to. So, we shouldn’t eagerly evaluate the list selector. We also don’t want something like right[|right|-1] to evaluate right twice. So we settle on [note: Lazy evaluation means only evaluating an expression when we need to. Thus, although we might encounter the expression for right, we only evaluate it when the time comes. Lazy evaluation, at least the way that Haskell has it, is more specific: an expression is evaluated only once, or not at all. ] . Ah, but the ! marker introduces [note: A side effect is a term frequently used when talking about functional programming. Evaluating the expression xs[rand]! doesn't just get a random element, it also changes something else. In this case, that something else is the xs list. ] . So we can’t just evaluate these things all willy-nilly. So, let’s make it so that each expression in the selector list requires the ones above it. Thus, left will require pivot, and right will require left and pivot. So, lazily evaluated, ordered expressions. The whole qselect becomes:

qselect(xs,k) =
    ~xs -> {
        pivot <- xs[0]!
        left <- xs[#0 <= pivot]
        right <- xs[#0 > pivot]
    } ->
        if k > |left| + 1 then qselect(right, k - |left| - 1)
        else if k == |left| + 1 then [pivot]
        else qselect(left, k);

We’ve now figured out all the language constructs. Let’s start working on some implementation!

Implementation

It would be silly of me to explain every detail of creating a language in Haskell in this post; this is neither the purpose of the post, nor is it plausible to do this without covering monads, parser combinators, grammars, abstract syntax trees, and more. So, instead, I’ll discuss the interesting parts of the implementation.

Temporary Variables

Our language is expression-based, yes. A function is a single, arbitrarily complex expression (involving if/else, list selectors, and more). So it would make sense to translate a function to a single, arbitrarily complex Python expression. However, the way we’ve designed our language makes it not-so-suitable for converting to a single expression! For instance, consider xs[rand]. We need to compute the list, get its length, generate a random number, and then access the corresponding element in the list. We use the list here twice, and simply repeating the expression would not be very smart: we’d be evaluating twice. So instead, we’ll use a variable, assign the list to that variable, and then access that variable multiple times.

To be extra safe, let’s use a fresh temporary variable every time we need to store something. The simplest way is to simply maintain a counter of how many temporary variables we’ve already used, and generate a new variable by prepending the word “temp” to that number. We start with temp0, then temp1, and so on. To keep a counter, we can use a state monad:

type Translator = Control.Monad.State.State (Map.Map String [String], Int)

Don’t worry about the Map.Map String [String], we’ll get to that in a bit. For now, all we have to worry about is the second element of the tuple, the integer counting how many temporary variables we’ve used. We can get the current temporary variable as follows:

currentTemp :: Translator String
currentTemp = do
    t <- gets snd
    return $ "temp" ++ show t

We can also get a fresh temporary variable like this:

incrementTemp :: Translator String
incrementTemp = do
    modify (second (+1))
    currentTemp

Now, the [note: Since we are translating an expression, we must have the result of the translation yield an Python expression we can use in generating larger Python expressions. However, as we've seen, we occasionally have to use statements. Thus, the translateExpr function returns a Translator ([Py.PyStmt], Py.PyExpr). ] for generating a random list access looks like [note: The Py.* constructors are a part of a Python AST module I quickly threw together. I won't showcase it here, but you can always look at the source code for the blog (which includes this project) here. ]

translateExpr (Access e Random m) = do
    temp <- incrementTemp
    (sts, ce) <- translateExpr e
    let lenExpr = Py.FunctionCall (Py.Var "len") [Py.Var temp]
    let randExpr = Py.FunctionCall (Py.Var "randint") [ Py.IntLiteral 0,  lenExpr ]
    return (sts, singleAccess ce randExpr m)
Implementing “lazy evaluation”

Lazy evaluation in functional programs usually arises from [note: Graph reduction, more specifically the Spineless, Tagless G-machine is at the core of the Glasgow Haskell Compiler (GHC). Simon Peyton Jones' earlier book, Implementing Functional Languages: a tutorial details an earlier version of the G-machine. ] . However, Python is neither functional nor graph-based, and we only lazily evaluate list selectors. Thus, we’ll have to do some work to get our lazy evaluation to work as we desire. Here’s what I came up with:

  1. It’s difficult to insert Python statements where they are needed: we’d have to figure out in which scope each variable has already been declared, and in which scope it’s yet to be assigned.
  2. Instead, we can use a Python dictionary, called cache, and store computed versions of each variable in the cache.
  3. It’s pretty difficult to check if a variable is in the cache, compute it if not, and then return the result of the computation, in one expression. This is true, unless that single expression is a function call, and we have a dedicated function that takes no arguments, computes the expression if needed, and uses the cache otherwise. We choose this route.
  4. We have already promised that we’d evaluate all the selected variables above a given variable before evaluating the variable itself. So, each function will first call (and therefore [note: Forcing, in this case, comes from the context of lazy evaluation. To force a variable or an expression is to tell the program to compute its value, even though it may have been putting it off. ] ) the functions generated for variables declared above the function’s own variable.
  5. To keep track of all of this, we use the already-existing state monad as a reader monad (that is, we clear the changes we make to the monad after we’re done translating the list selector). This is where the Map.Map String [String] comes from.

The Map.Map String [String] keeps track of variables that will be lazily computed, and also of the dependencies of each variable (the variables that need to be access before the variable itself). We compute such a map for each selector as follows:

    let prereqs = snd $ foldl (\(ds, m) (Selector n es) -> (n:ds, Map.insert n ds m)) ([], Map.empty) ss

We update the existing map using Map.union:

    modify $ first $ Map.union prereqs

And, after we’re done generating expressions in the body of this selector, we clear it to its previous value vs:

    modify $ first $ const vs

We generate a single selector as follows:

translateSelector :: Selector -> Translator Py.PyStmt
translateSelector (Selector n e) =
    let
        cacheCheck = Py.NotIn (Py.StrLiteral n) (Py.Var "cache")
        cacheAccess = Py.Access (Py.Var "cache") [Py.StrLiteral n]
        cacheSet = Py.Assign (Py.AccessPat (Py.Var "cache") [Py.StrLiteral n])
        body e' = [ Py.IfElse cacheCheck [cacheSet e'] [] Nothing, Py.Return cacheAccess]
    in
        do
            (ss, e') <- translateExpr e
            vs <- gets fst
            let callPrereq p = Py.Standalone $ Py.FunctionCall (Py.Var p) []
            let prereqs = maybe [] (map callPrereq) $ Map.lookup n vs
            return $ Py.FunctionDef n [] $ ss ++ prereqs ++ body e'

This generates a function definition statement, which we will examine in generated Python code later on.

Solving the problem this way also introduces another gotcha: sometimes, a variable is produced by a function call, and other times the variable is just a Python variable. We write this as follows:

translateExpr :: Expr -> Translator ([Py.PyStmt], Py.PyExpr)
translateExpr (Var s) = do
    vs <- gets fst
    let sVar = Py.Var s
    let expr = if Map.member s vs then Py.FunctionCall sVar [] else sVar
    return ([], expr) 
Special Case Insertion

This is a silly language for a single homework assignment. I’m not planning to implement Hindley-Milner type inference, or anything of that sort. For the purpose of this language, things will be either a list, or not a list. And as long as a function can return a list, it can also return the list from its base case. Thus, that’s all we will try to figure out. The checking code is so short that we can include the whole snippet at once:

getPossibleType :: String -> Expr -> PossibleType
getPossibleType s (Var s') = if s == s' then List else Any
getPossibleType _ (ListLiteral _) = List
getPossibleType s (Split _ _ e) = getPossibleType s e
getPossibleType s (IfElse i t e) =
    foldl1 mergePossibleType $ map (getPossibleType s) [i, t, e]
getPossibleType _ (BinOp Insert _ _) = List
getPossibleType _ (BinOp Concat _ _) = List
getPossibleType _ _ = Any

mergePossibleType [note: An observant reader will note that this is just a logical OR function. It's not, however, good practice to use booleans for types that have two constructors with no arguments. Check out this Elm-based article about this, which the author calls the Boolean Identity Crisis. ] , given two possible types for an expression, the final type for the expression.

There’s only one real trick to this. Sometimes, like in _search, the only time we return something known to be a list, that something is xs. Since we’re making a list manipulation language, let’s assume the first argument to the function is a list, and use this information to determine expression types. We guess types in a very basic manner otherwise: If you use the concatenation operator, or a list literal, then obviously we’re working on a list. If you’re returning the first argument of the function, that’s also a list. Otherwise, it could be anything.

My Haskell linter actually suggested a pretty clever way of writing the whole “add a base case if this function returns a list” code. Check it out:

translateFunction :: Function -> Translator [Py.PyStmt]
translateFunction (Function n ps ex) = do
    let createIf p = Py.BinOp Py.Equal (Py.Var p) (Py.ListLiteral [])
    let createReturn p = Py.IfElse (createIf p) [Py.Return (Py.Var p)] [] Nothing
    let fastReturn = [createReturn p | p <- take 1 ps, getPossibleType p ex == List]
    (ss, e) <- translateExpr ex
    return $ return $ Py.FunctionDef n ps $ fastReturn ++ ss ++ [Py.Return e]

Specifically, look at the line with let fastReturn = .... It uses a list comprehension: we take a parameter p from the list of parameter ps, but only produce the statements for the base case if the possible type computed using p is List.

The Output

What kind of beast have we created? Take a look for yourself:

def qselect(xs,k):
    if xs==[]:
        return xs
    cache = {}
    def pivot():
        if ("pivot") not in (cache):
            cache["pivot"] = xs.pop(0)
        return cache["pivot"]
    def left():
        def temp2(arg):
            out = []
            for arg0 in arg:
                if arg0<=pivot():
                    out.append(arg0)
            return out
        pivot()
        if ("left") not in (cache):
            cache["left"] = temp2(xs)
        return cache["left"]
    def right():
        def temp3(arg):
            out = []
            for arg0 in arg:
                if arg0>pivot():
                    out.append(arg0)
            return out
        left()
        pivot()
        if ("right") not in (cache):
            cache["right"] = temp3(xs)
        return cache["right"]
    if k>(len(left())+1):
        temp4 = qselect(right(), k-len(left())-1)
    else:
        if k==(len(left())+1):
            temp5 = [pivot()]
        else:
            temp5 = qselect(left(), k)
        temp4 = temp5
    return temp4
def _search(xs,k):
    if xs==[]:
        return xs
    if xs[1]==k:
        temp6 = xs
    else:
        if xs[1]>k:
            temp8 = _search(xs[0], k)
        else:
            temp8 = _search(xs[2], k)
        temp6 = temp8
    return temp6
def sorted(xs):
    if xs==[]:
        return xs
    return sorted(xs[0])+[xs[1]]+sorted(xs[2])
def search(xs,k):
    return len(_search(xs, k))!=0
def insert(xs,k):
    return _insert(k, _search(xs, k))
def _insert(k,xs):
    if k==[]:
        return k
    if len(xs)==0:
        temp16 = xs
        temp16.append([])
        temp17 = temp16
        temp17.append(k)
        temp18 = temp17
        temp18.append([])
        temp15 = temp18
    else:
        temp15 = xs
    return temp15

It’s…horrible! All the tempX variables, three layers of nested function declarations, hardcoded cache access. This is not something you’d ever want to write. Even to get this code, I had to come up with hacks in a language I created. The first is the hack is to make the qselect function use the xs == [] base case. This doesn’t happen by default, because qselect doesn’t return a list! To “fix” this, I made qselect return the number it found, wrapped in a list literal. This is not up to spec, and would require another function to unwrap this list.

While qselect was struggling with not having the base case, insert had a base case it didn’t need: insert shouldn’t return the list itself when it’s empty, it should insert into it! However, when we use the << list insertion operator, the language infers insert to be a list-returning function itself, inserting into an empty list will always fail. So, we make a function _insert, which takes the arguments in reverse. The base case will still be generated, but the first argument (against which the base case is checked) will be a number, so the k == [] check will always fail.

That concludes this post. I’ll be working on more solutions to homework assignments in self-made languages, so keep an eye out!