Blog

A Year in Caiman

May 31, 2024 | 36 minutes read

Caiman is a language designed for heterogeneous programming that formalizes the communication between devices, allocations, and computations performed. This post details the current state of the Caiman Frontend and includes a brief overview of the implementation of some features such as lowering to continuation passing style and the e-graph-based type inference.

Background

In heterogeneous programming, communication and synchronization between devices is extremely important for performance, but it’s not at all obvious what the optimal communication strategy should be. Take the following simple C program for example:

int selectSum(int * a, int * b) {
    int sum_a = sum(a);
    if (sum_a > 0) {
        return sum(b);
    } else {
        return sum_a;
    }
}

Instead of performing sum locally, a user might want to execute this on another device, perhaps a GPU. This might lead to another implementation that looks like this:

int selectSumGPU(int * a, int * b) {
    int sum_a;
    sumGPU(a, &sum_a);
    barrier();
    if (sum_a > 0) {
        int sum_b;
        sumGPU(b, &sum_b);
        barrier();
        return sum_b;
    } else {
        return sum_a;
    }
}

But wait, if we’re going to run these computations on the GPU, why don’t we run them in parallel? Maybe let’s trade extra work for increased parallelism and have an implementation like this:

int selectSumGPU2(int * a, int * b) {
    int sum_a, sum_b;
    sumGPU(a, &sum_a);
    sumGPU(b, &sum_b);
    barrier();
    if (sum_a > 0) {
        return sum_b;
    } else {
        return sum_a;
    }
}

Or perhaps, we perform one computation on the GPU, and another computed locally to still have the two calls to sum in parallel but reduce the amount of data transfer between the two devices:

int selectSumGPU3(int * a, int * b) {
    int sum_a;
    sumGPU(a, &sum_a);
    int sum_b = sum(b);
    barrier();
    if (sum_a > 0) {
        return sum_b;
    } else {
        return sum_a;
    }
}

Now we quickly came up with 4 different implementations of the same function, so which one is fastest? Well, that depends. If it’s very common for sum_a to be positive, then perhaps selectSumGPU2 or selectSumGPU3 would be ideal. But if the sizes of a and b are too small, then perhaps selectSum would be ideal since it doesn’t have the overhead of device synchronization and communication. The only real way to know what’s fastest for a given use case is to time them all.

If we were to write a function, foo that makes two calls to selectSum, then with 4 implementation choices, that gives us 16 ways to implement foo. If another function bar calls foo twice, then we have 256 ways of implementing it just by changing which selectSum implementation we use. So determining what the fastest implementation is for a given workload involves an exploration of an exponentially large search space.

While this example might not seem that motivating, many important algorithms are composed of basic building blocks similar in spirit to selectSum.

Caiman solves this program space exploration problem in two ways. First, it provides specification languages to specify what something computes (the value language), the synchronization and communication involved in the computation (the timeline language), and the allocations and deallocations (in the spatial language). Notice in the selectSum example that each of the implementations computed the same thing and none of them had any allocations or deallocations. In Caiman, they would all have identical value and spatial specifications, but different timeline specs. This allows you to write implementations more easily by declaring what specs it implements and getting a guarantee that the implementation adheres to them. In the selectSum example, this would prevent us from implementing a selectSum that doesn’t compute the correct result.

Secondly, Caiman introduces an explicator that can use the specifications to wholly or partially generate implementations.

My work on this project has been developing a front-end user-facing language for Caiman.

The Caiman Frontend

When I came into this project, there was no way for someone to actually use Caiman. For testing, programs could be written in Caiman Assembly, a textual format that is a thin wrapper over Caiman’s intermediate representation (IR). Programs written in this form are extremely verbose and difficult to write. Writing a simple program to compute the minimum of two integers could take nearly 100 lines of code and a few hours to days to implement correctly by hand.

For my project, I designed and implemented a user-level language that someone could use to write Caiman programs. Some of my contributions are:

An imperative frontend language with variables, control flow, and recursion
A high-level intermediate representation for frontend analysis and optimization
Frontend type inference for data types and specification adherence
A couple hundred tests with coverage of the new frontend and the existing backend; and various fixes in the backend from bugs exposed by these tests
A VScode language extension for the Caiman Frontend

Control Flow

One of the things that makes writing programs in Caiman Assembly so difficult is the fact that the IR represents control flow in continuation passing style (CPS). In CPS, control flow is made explicit by each function invoking a continuation to pass control to the next function. These continuations can be thought of as “the rest of the program”. In CPS, an if would be implemented as calling one of two functions to pass control flow to. This representation is convenient for the compiler, but unintuitive for a human to write. For example, a Rust function like this:

fn foo(x: i32) {
    let c = if x < 0 {
        10
    } else { 
        20 
    };
    if x & 1 == 0 {
        c
    } else {
        c * 2
    }
}

would end up looking something like this in CPS:

let foo = |x: i32, k2: &Fn(i32)|  {
    (|x: i32, k1: &Fn(i32, i32)| { 
        let cond = x < 0;
        (|k3: &Fn(i32), k4: &Fn(i32)| { if cond { k3(x) } else { k4(x) } })
            (&|x: i32| k1(x, 10), &|x: i32| k1(x, 20))
    }) (x, &|x: i32, c: i32| {
        let cond2 = x & 1 == 0;
        (|k5: &Fn(i32), k6: &Fn(i32)| { if cond2 { k5(c) } else { k6(c) } })
            (&|c: i32| k2(c), &|c: i32| k2(c * 2))
    })
};

There is a close correspondence between the CPS and control flow graph (CFG) representation of a program, as seen by the following control flow graph for this example program:

CFG

In the CPS example, the starting block (which I’m calling k0) uses cond to select between invoking k3 and k4, both of which pass their result to k1. Then k1 uses cond2 to select between invoking k5 and k6, and both of these pass their result to k2. This is the same structure depicted in the CFG, so it’s fairly easy to convert between CFG and CPS representations.

By using a control flow graph as an intermediary step, we can use standard compiler techniques to transform the input Caiman program into a CFG, and then transform the CFG into the CPS form of the Caiman assembly, which is the target language of the Caiman Frontend.

A typical live variables analysis over the CFG representation of the program is then used to determine the input variables to each continuation. In the example, this can be seen with how x is passed from k0 to k2 and k3, and then to k1. And how c is passed from k1 to k5 and k6, and then to k2.

The Caiman Frontend provides standard arithmetic operators and nested expressions programmers would be accustomed to from most programming languages. For example, something like 3 + 4 * (x / 3) which performs 3 operations in one expression. Flattening expression trees is a standard thing for compiler frontends to perform, but in Caiman, all operations must be implemented as functions on a particular device. So we lower operators to function calls of standard library functions that run on the local device.

The Caiman Frontend also enables recursive functions. In Caiman Assembly, recursion needs to have yield points, which gives control back to the user. The operational model of Caiman is that the user writes a pipeline, which may not have cycles. To implement loops, that pipeline would be invoked repeatedly. So a yield point gives control back to the user so they can invoke the pipeline again or get the result when the recursion stops.

The Caiman Frontend will construct a call graph and automatically insert yield points where recursion is detected, allowing a user to not have to worry about the underlying technical details that enable recursion.

For example, this is how one could implement a recursive sum that computes the sum from a to b in Caiman:

#version 0.1.0

// timeline spec, no synchronizations so the spec is trivial
tmln time(e: Event) -> Event { returns e }

// spatial spec, no allocations so the spec is trivial
sptl space(s: BufferSpace) -> BufferSpace { returns s }

// value spec, basically describes the actual computation being done in a way
// that's agnostic to allocations and synchronizations
val rec_sum(a: i64, b: i64) -> i64 {
    // observe that specifications invoke other specifications,
    // NOT implementations, allowing a single spec to have
    // many different implementations. So we use `rec_sum`
    // here but `rec_sum_impl` in the implementation

    // specifically, a specification will call a function equivalence
    // class. Any spec that isn't a member of an explicit user-defined
    // equivalence class is a member of its own equivalence class
    // which shares its name
    returns 0 if a > b else 
            a if a == b else
            rec_sum(a + 1, b) + a
}

//                     The specs the schedule implements
// implementation  (aka the schedule)        v
fn rec_sum_impl(a: i64, b: i64) -> i64 impls rec_sum, time, space {
    // no return keyword, the final expression is the return value
    if a > b {
        0
    } else if a == b {
        a
    } else {
        rec_sum_impl(a + 1, b) + a
    }
}

// the main pipeline is essentially the entry point to the program
pipeline main { rec_sum_impl }

Excluding whitespace and comments, this implementation takes 18 lines of code, compared to the 131 lines of code required to implement the same thing in the textual representation of the Caiman Assembly.

As this example shows, this work enables nested control flow in the specifications and implementations and nested expressions, such as rec_sum_impl(a + 1, b) + a, which would otherwise need to be implemented as a series of commands in Caiman Assembly.

In Caiman, if something is explicitly mentioned in the spec, then it must occur in the implementation. For example, consider this function which returns 2 in a convoluted way:

val min() -> i64 {
    returns (2 if 2 < 1 else 1) + 1
}

fn min_impl() -> i64 impls min, time, space
{
    // declare a constant
    let a = 2;
    // The Caiman Frontend allows if branches to return a value and store
    // it in a variable in the parent scope. This works like Rust
    let r = if 2 < 1 {
        a
    } else {
        1
    };
    // the return value is the final expression (no terminating semicolon)
    // again, this is the same as Rust
    r + 1
}

We could not have an implementation that just returns 2, even though that’s essentially what happens. Every operation that we have in the spec must also occur in the implementation. Moreover, this implementation would also not work:

fn min_impl() -> i64 impls min, time, space
{
    let a = 2;
    let r = if a < 1 {
        a
    } else {
        1
    };
    r + 1
}

That’s because we specified two distinct uses of the constant 2 but only have one shared use in this implementation. The Caiman Frontend could generate code in a way that this would work, however we didn’t want any kind of pass to change what the user specified.

Select Sum in Caiman

As a more complicated example, let’s see our motivating example of select sum in Caiman. For simplicity, we’re going to focus on computing summations between two bounds.

#version 0.1.0
tmln time(e: Event) -> Event { returns e }
sptl space(s: BufferSpace) -> BufferSpace { returns s }

// function equivalence class, declaring that these
// are all equivalent
feq rec_sum {
    extern(cpu) pure rec_sum_cpu(i32, i32) -> i32

    extern(gpu) rec_sum_gpu(i32, i32) -> i32 {
        path: "rec_sum_gpu.comp",
        // the number of dimensions is the dimensions of the grid
        // to dispatch this kernel to
        dimensions: 3,
        // ...
        // some details about the external 
        // GPU kernel we're calling omittted
    }

    // rec sum spec from before
    val rec_sum(a: i32, b: i32) -> i32 {
        returns 0 if a > b else 
                a if a == b else
                // special _dim constants are available for specifications
                // (and schedules that implement them) which are in the same
                // function class as a GPU extern

                // this is a call to the `rec_sum` function equivalence class
                rec_sum'<_dim0, _dim1, _dim2>(a + 1, b) + a
    }
}

// rec sum implementation from before, now just with
// dimension constants since the specification `rec_sum` is a member of
// the `rec_sum` equivalence class which contains a GPU extern
// that takes 3 dimensional arguments
fn rec_sum_local(a: i32, b: i32) -> i32 impls rec_sum, time, space {
    if a > b {
        0
    } else if a == b {
        a
    } else {
        rec_sum_local'<_dim0, _dim1, _dim2>(a + 1, b) + a
    }
}

// value spec for select sum
val select_sum(a: i32, b: i32, c: i32) -> i32 {
    // the specification invokes the equivalence class
    // `feq rec_sum`, and an implementation can choose to invoke
    // `rec_sum_cpu`, `rec_sum_gpu`, or an implementation
    // of the spec `rec_sum`

    // as such, we must specify the grid size since some
    // implementation may choose to invoke the GPU version
    x :- rec_sum'<1, 1, 1>(a, b)
    returns rec_sum'<1, 1, 1>(b, c) if x < 0 else x
}

// `selectSum` in our motivating example
fn select_sum_local(a: i32, b: i32, c: i32) -> i32 impls select_sum, time, space {
    let x = rec_sum_local'<1, 1, 1>(a, b);
    if x < 0 {
        rec_sum_local'<1, 1, 1>(b, c)
    } else {
        x
    }
}

// timeline spec for a select sum with two encodings, submissions
// and synchronizations in parallel on every control path
tmln double_sync(e: Event) -> Event {
    // encode_event, submit_event, and sync_event
    // are special functions provided in the timeline language

    // the first argument and first return of a timeline spec
    // are implicit events. This can be thought of as tracking the
    // latest thing that the coordinator knows about.
    loc1, rem1 :- encode_event(e)
    sub :- submit_event(rem1)
    loc2, rem2 :- encode_event(loc1, sub)
    sub2 :- submit_event(rem2)
    f1 :- sync_event(loc2, sub)
    f2 :- sync_event(f1, sub2)
    returns f1
}

// `selectSumGPU`
fn select_sum_gpu(a: i32, b: i32, c: i32) -> i32 
    // we still implement `select_sum` and `space`, but
    // now we implement `double_sync` instead of `time`
    impls select_sum, double_sync, space 
{
    // encoding requires we supply a reference, and we cannot
    // take references of constants so we copy the variables here
    var a = a;
    var b = b;
    var c = c;
    // start encoding a command to the GPU
    let e = encode-begin gpu;
    // encode a copy command which will copy a reference to a local variable
    // to the device we are encoding commands to
    encode e.copy[a_gpu <- &a];
    encode e.copy[b_gpu <- &b];
    // encode the invocation to the `rec_sum_gpu` kernel
    // this kernel has 3 dimensions, x, y, and z. The 
    // '<1, 1, 1> here is setting the grid dimensions
    // of the kernel dispatch to be a 1x1x1 grid
    // Basically, we're computing this on the GPU but not
    // actually using the data parallelism of the GPU.
    encode e.call[x_gpu <- rec_sum_gpu'<1, 1, 1>(a_gpu, b_gpu)];
    // dispatch the buffered commands and get a fence that we can
    // wait on to get the result

    // the commands we encode don't take effect until they're
    // submitted. Submitted commands are executed concurrently
    let f1 = submit e;


    // make another encoding
    // each encoder has its own scope, so `b_gpu` doesn't conflict with
    // `b_gpu` from the earlier encoder
    let e2 = encode-begin gpu;
    encode e2.copy[b_gpu <- &b];
    encode e2.copy[c_gpu <- &c];
    encode e2.call[x_gpu <- rec_sum_gpu'<1, 1, 1>(b_gpu, c_gpu)];
    let f2 = submit e2;
    // synchronize on the first result and copy `x_gpu` from the
    // GPU back to the local device
    let x = (await f1).x_gpu;
    // synchronize on the second result
    let r = (await f2).x_gpu


    if x < 0 {
        r
    } else {
        x
    }
}

pipeline main { select_sum_gpu }

Caiman uses a single quote followed by angled brackets for setting the grid and block sizes of a kernel dispatch. This syntax came about since it’s not ambiguous like C++’s template argument syntax and is (in my opinion) less ugly than CUDA’s triple angle brackets. Although not shown here, the single quote followed by angled brackets or a single quote followed by a type is the general syntax approach taken for things that can be thought of as “template arguments”.

GPU variables are syntactically and semantically in another scope. This prevents a user from mixing up local and encoded variables. For example, in select_sum_gpu, a user could declare a new variable a_gpu, which would not shadow the existing a_gpu that is encoded. Furthermore, b_gpu from e2 doesn’t conflict with b_gpu from encoder e, since the two encoders effectively have different scopes. The idea here is to make the encoded commands almost like a tiny DSL embedded into Caiman. The slightly different syntax emphasizes the different semantics of encoding a command to be executed later on another device versus executing something immediately on the local device.

One design decision I made is that an encoded call must use arguments that have already been encoded into the command buffer with e.copy[], and expressions on the right-hand side of <- cannot be nested. This isn’t necessary, as the compiler could insert copies for you, but I didn’t like the implicit nature of this. For example, suppose the compiler did insert copies as necessary, then we wouldn’t need an encode copy construct at all. Syntactically looking at something like encode e[y <- kernel'<x, y, z>(local_buffer, g + y)], a user might not realize that local_buffer is copied over to another device, but more importantly, it becomes unclear of whether g and y are copied over and then added on the other device or if g + y is computed locally and then the result is copied.

For e.copy, nested expressions are allowed on the right-hand side of <-, in which case they are flattened on the local device and the final result is copied. I found this to be OK since the idea of copy is taking something from the local device and putting it into something on the remote device.

A timeline spec has 3 different types, encoders, fences, and events. Events are essentially local events that the coordinator is aware of, encoders are used to encode commands to a remote device and fences are single-use synchronization points between two devices.

When writing select_sum_gpu in this example, I actually made a mistake and switched the branches of the if, which was quickly caught by the compiler providing linting to VScode via a custom language extension.

Image of Error

Variables and References

The Caiman Frontend uses immutability as the default, but it also provides references and variables that allow mutation. These features hide the allocation of temporaries that is made explicit in the Caiman Assembly.

Variables in Caiman are mutable values, but since they are implemented as a location on the stack, a dereference pass is used to insert loads and stores to the variable’s location on the stack as needed. This is done via a dataflow analysis which inserts a load right before the first use of a variable since it was last written.

Using variables, we can compute the minimum of two numbers like so:

val min(a: i64, b: i64) -> i64 {
    returns a if a < b else b
}

fn min_impl(a: i64, b: i64) -> i64
    impls min, time, space
{
    var v;
    if a < b {
        v = a;
    } else {
        v = b;
    }
    v
}

The Frontend supports taking references of variables and uses C/C++/CUDA syntax, as this is likely what a hypothetical Caiman user would be very familiar with:

val main() -> i64 {
    c :- 50
    returns one(c)
}

val one(a: i64) -> i64 {
    returns a + 1
}

fn main_func() -> i64 impls main, time, space  
{
    var x = 50;
    *one_func(&x)
}

fn one_func(a: &i64) -> &i64 impls one, time, space
{
    // load the value from `a`, add 1,
    // then store it back into `a`
    *a = *a + 1;
    // return the reference
    a
}

One thing to note is that the specification one doesn’t know anything about the fact that the implementation, one_func is implemented with references. We could have written an equally good implementation without references. This still type-checks however because &i64 refines the i64 specified in the specification.

The Caiman type system does not allow a reference to be captured across a functioin call. For example, the following implementation would not be allowed:

fn foo() -> i64 impls // ...
{
    let x: &i64 = // ...
    bar();
    *x + 1
}

Despite that variables are references under the hood, the Caiman Frontend does let you do this with variables. To support this, the Frontend will insert loads and stores so that a value, not a reference is captured. So under the hood, we convert code like this:

fn foo() -> i64 impls // ...
{
    var x = // ...
    bar();
    x + 1
}

fn foo() -> i64 impls // ...
{
    var x = // ...
    let _x = x;
    // _x is captured across the function call, which is a value
    bar();
    var x = _x;
    x + 1
}

HIR

As alluded to earlier, one part of this work was designing a new higher-level intermediate representation (HIR). This IR served as a stepping stone between the frontend abstract syntax tree (AST), and the Caiman Assembly.

I chose a CFG and three-address-code representation, a common design where the program is split into basic blocks of instructions, each instruction performs a single operation (typically having two arguments and a destination, hence the three addresses in three-address-code) and any control flow is the last instruction in the block. Control flow between basic blocks is represented as edges in a control flow graph.

Like the Caiman Assembly, this IR makes loads and stores explicit, which enables the aforementioned implicit load insertion pass. It also is in single-static-assignment (SSA) form, where a variable is only assigned once.

The HIR also includes a generalized dataflow solver, which can be parameterized on various dataflow analyses.

Type Inference

A major feature that makes writing Caiman programs bearable is type inference. So far, all of the examples we’ve seen have used type inference to avoid writing down the verbose types in Caiman. A type in a Caiman specification is just the base type (what most programmers would think of when they hear “type”), but a type in a Caiman implementation includes these components:

The base type, such as i64, &i32, or bool
Flags that dictate where the memory holding the variable is. For example, i64'<storage, map_read> denotes an i64 that is held on the GPU which is mapped for reading from the local device.
For each of the value, spatial, and timeline specs, a tag consisting of:
- A quotient, which maps a piece of the implementation back to the spec. For example, node(val.x) is the type associating a piece of the implementation to variable x in the value specification, while input(tmln.y) associates a piece of the implementation to the input variable y in the timeline specification.
- A flow, which tracks the status of the associated linear resource. For example, usable denotes that a resource can be consumed, while dead denotes that the resource is empty.

The tags of a variable can change as it is assigned to or passed between basic blocks. Furthermore, a tag is not just applied to variables but is also used to “type” control flow such as if and function calls. For example, here is the min example from earlier with explicit type annotations for everything (tags are denoted after the @ symbol):

val min(a: i64, b: i64) -> out: i64 {
    c: bool :- a < b
    ret: i64 :- a if c else b
    returns ret
}

fn min_impl(
    a: i64 @ [input(val.a)-usable, none(tmln)-usable, none(sptl)-usable], 
    b: i64 @ [input(val.b)-usable, none(tmln)-usable, none(sptl)-usable]) -> 
    i64 @ [node(val.out), none(tmln)-usable, none(sptl)-usable]
    impls min, time, space
{
    var v: i64 @ [none(val)-dead, none(tmln)-usable, none(sptl)-saved];
    let c: bool @ [node(val.c)-usable, none(tmln)-usable, none(sptl)-usable]
        = a < b;
    if @ [node(val.ret), none(sptl), none(tmln)] c {
        v = a;
        @out { v: [node(val.a)-usable, none(tmln)-usable, none(sptl)-saved] };
    } else {
        v = b;
        @out { v: [node(val.b)-usable, none(tmln)-usable, none(sptl)-saved] };
    }
    @in { v: [node(val.ret)-usable, none(tmln)-usable, none(sptl)-saved] };
    v
}

You don’t have to understand what these types mean, but the key thing to note is that it’s time-consuming to write by hand and that it’s somewhat non-trivial: in this example, v changes its quotient and flow 3 times.

The type inference algorithm I implemented is basically Hindley-Milner. My implementation is basically a tree matching over an abstract set of type nodes. Like most things in PL, we can think of a type as a tree. For example, the type i32 could be represented as the tree of order 2 and size 1, with a root node for int and a child leaf for the size of the integer.

int
 |
 |
i32

The only difference between working with base types and working with quotients is the set of vertices. So, by abstracting away this set we can have the same code work for both data type deduction and quotient deduction. Flows are deduced separately, via a dataflow analysis. Certain instructions change the flow predictably, and the new flow stays until a later instruction in the program changes the flow again.

Deducing quotients is a bit more involved, and since the base type inference is a strict subset of the quotient deduction, we’ll continue our discussion with the quotient deduction.

Specifically, the implementation of the type inference algorithm is based on e-graphs. We have a union-find data structure where the canonical representative of an equivalence class is mapped to an e-class id, in this case, these IDs will be variables in a specification.

Since tags can change as we write to references or update variables, we first convert the program to a reference SSA form, where variables are assigned and references stored, only once. This gives every value of a variable a distinct name we can use to refer to it during type inference. Then we scan through each spec building up an e-graph. Next, we scan through the implementations and build up a disconnected e-graph. Eventually, something will ensure that a part of the implementation matches a part of the spec and we’ll unify the two graphs.

Consider the following pseudo-Caiman program:

val foo(x: i64) -> out: i64 {
    d :- x < 0
    left :- x * 2
    right :- x - 2
    ret :- left if d else right
    returns ret
}

fn foo_impl(x: i64) -> i64 {
    let d = x < 0;
    if {
        let left = x * 2;
        let _out.0 = left;
    } else {
        let right = x - 2;
        let _out.1 = right;
    }
    // phi nodes are not something a user can use,
    // they exist only in the HIR
    let ret = phi _out.0 _out.1;
    let _out.2 = ret;
    _out.2

}

Going line by line, if we look at the line d :- x < 0 in the spec, we know that a spec variable $d is going to map to an expression like Lt(x, 0). The dollar sign here denotes the id of an equivalence class, whereas a name without a dollar sign just represents a variable. Now we also know that $x is the equivalence class id of the input variable x in the spec. This gives the following graph, depicting two equivalence classes where dotted lines represent nodes in the same equivalence class and solid edges connect parent nodes in the type expression tree to their children:

Building Graph

Now when we look at the child of Lt(x, 0), x, we can unify these two equivalence classes to produce a graph like so:

Building Graph2

Now the variable x is equivalent to the node Input(x) with the e-class id of $x, so we can represent this as so:

Building Graph3

Continuing this process for the spec and implementation, we reach the following e-graph where the spec is depicted on the left and the implementation on the right and class ids are denoted with a “$” whereas variables are denoted with a “%”.

EGraph1

At this point, we know that the variable %_out.2 must be equivalent to the spec variable $out since %_out.2 is returned from the implementation and $out is the spec’s return value. So we can unify the two classes as shown:

EGraph2

But once we do that, we know that the Return type expression of the implementation is equivalent to the Return type expression in the implementation (depicted with the existence of a path between these two nodes consisting solely of dotted edges). So, we can unify the corresponding children of these nodes, in this case the Select nodes:

EGraph3

We can keep recursively applying this transformation until we get the following graph:

EGraph4

Now, if we were to look up the class-id of the implementation variable %d, we see that it is in an equivalence class with the id $d. Thus, the quotient for %d must be node(val.d). Unification can fail if we attempt to unify two nodes that do not match, for example, if we try and unify Input(x) with Mul(Input(x), 2). If this happens, we give the user an error; otherwise, we continue and use the e-graph to fill in annotations in our program. This makes the tags easily accessible for the final lowering pass into Caiman Assembly, which doesn’t happen when the program is in SSA form.

Now consider a specification like this:

val foo() -> i64 {
    res1, res2 :- bar()
    returns res1
}

and imagine an implementation which together with the above spec generates the following e-graph:

EGraph3

After unifying the return values, we might get something that looks like this:

EGraph3

Notice that the implementation variable %b doesn’t match up to anything in the specification. However, we can also observe that the specification contains only one subtree of the form Extract(1, bar()), which is in the class with id $res2. So, if we come across a matching subtree in the implementation, then since there is a unique, isomorphic tree in the specification, we know that those subtrees have to match. This would automatically give us that %b is equivalent to $res2, by the logic that there is nothing else it could possibly be equivalent to.

This unification strategy is a more bottom-up approach as it starts with nodes that must be equivalent and uses that equivalence to build up equivalences between larger trees. The earlier approach I described is a more top-down unification style as it starts with two trees that must be equivalent and uses that to infer equivalences of subtrees.

The Caiman Frontend does both. It starts with a top-down unification pass, after which, it will continually unify subtrees that have unique matches in a bottom-up style until every implementation variable either belongs to an e-class or has multiple potential matches. This allows an implementation of the previous value spec to be done without any annotations.

A similar process happens for type-checking base types and flags except they don’t need e-class ids. Moreover, since the base types and flags of a variable cannot change, this form of type inference is performed at the AST level, before it’s lowered down into HIR. Since things like quotient deduction and the load insertion pass require knowing information about every variable’s base type, the data type inference needs to happen sooner. While it could be done at the HIR, doing it at the AST level allows stripping away data type annotations when lowering the AST to the HIR, simplifying the HIR.

Type Inference Limitations

There are three notable situations where a user must supply type annotations:

All base types of the inputs and outputs of a spec or implementation must be annotated. All flags of the inputs and outputs of an implementation must be annotated if the argument is not located on the local device.
Quotients and flows must be specified at function boundaries if a spec is implemented across multiple functions or if the function call doesn’t match up to a call in the spec.
An implementation contains an unconstrained subtree with multiple matches in the spec.

These all boil down to the following general guidelines:

Types of signatures must be annotated, except for quotients and flows when the arguments match the specification and flags when the arguments are on the local device.
Return values from functions must be annotated if the function call doesn’t match up to a call in the caller’s spec.
Unconstrained code (either because it’s dead or because it’s only used as an argument to a function that doesn’t match up to a call in the spec) that isn’t uniquely identifiable must be annotated.

These restrictions prevent type inference from becoming a backtracking algorithm that must search through an exponentially large search space. Further, they also enable a type inference algorithm that won’t make arbitrary choices that a user might not desire. Other than these situations, a user can get away with letting the compiler figure out the types for everything else that’s currently implemented in the language.

In regards to a function call which doesn’t have a matching call in the spec, here’s an example of implementing one spec across two implementation functions:

// ...

val foo(x: i64) -> i64 {
    g :- x + 10
    h :- g * 2
    k :- h + g 
    returns k /  2
}

fn foo1(z: i64) -> i64 impls foo, time, space {
    // annotation of result of function call required here,
    // note there is no call in the spec of `foo`
    let g @ node(val.g), h @ node(val.h) = foo2(z);
    (h + g) / 2
}

fn foo2(a: i64 @ node(val)) -> 
    (i64 @ node(val.g), i64 @ node(val.h)) 
    impls foo, time, space 
{
    let g = a + 10;
    (g, g * 2)
}

pipeline main { foo1 }

Another form of this is how one could implement a function that uses output parmeters:

val main() -> i64 {
    returns one()
}

val one() -> i64 {
    returns 10
}


fn main_func() -> i64 impls main, time, space  
{
    var x;
    *one_func(&x)
}

// the spec `one` doesn't have an argument that `a` corresponds to,
// thus it's quotient is `none`. It's flow is `dead` because there is no
// value contained within the variable when it's passed to the function
fn one_func(a: &i64 @ none(val)-dead) -> &i64 
    impls one, time, space
{
    *a = 10;
    a
}

Notice how the call to one_func from main_func doesn’t need any annotations, despite the fact that we are trying to unify one_func(%x) with one(). The classic unification algorithm would fail here, but we support this. If a call in the schedule is unifying with a call in the spec that has n less arguments, we assume that the first n arguments that are unconstrained during the top-down pass are none, and drop them during unification. This may seem strange, but note that in the top-down pass, the only way for the arguments to be unified before the function call is if they have user-supplied annotations or are used in something that was given a type annotation by the user. So this system allows a user to precicesly control it by providing type annotations.

Here’s an example of a function that has isomorphic unconstrainted subtrees:

val foo(x: i64) -> i64 {
    a :- 1
    b :- 1
    c :- 1
    returns x
}

fn foo_impl(z: i64) -> i64 impls foo, time, space {
    // no way to tell which `1` we mean, so we need annotations
    
    // type inference will not make arbitrary choices
    // (except for assuming that inputs and outputs match up with the spec
    // unless otherwise specified)
    let x @ node(val.a) = 1;
    let y = 1 @ node(val.b);
    let h = 1 @ node(val.c);
    z
}

Records and Subtyping

Let’s look at a simple program that uses the timeline spec:

extern(gpu) simple(x : i32) -> out: i32
{
    dimensions: 3,
    // ..
}

extern(gpu) simple2(x : i32) -> out: i32
{
    dimensions: 3,
    // ..
}

//                            name the output parameter so 
//                            we can refer to it in a type annotation
//                            v
val foo(x: i32, b: bool) -> out: i32
{
    y :- simple'<1, 1, 1>(x)
    y2 :- simple2'<1, 1, 1>(x)
    ret :- y if b else y2
    returns ret
}

tmln foo_time(e: Event) -> out: Event {
    loc1, rem1 :- encode_event(e)
    sub :- submit_event(rem1)
    snc :- sync_event(loc1, sub)
    returns snc
}

sptl space(bs: BufferSpace) -> out: BufferSpace {
    returns bs
}


fn foo_impl(x: &i32, b: bool) -> i32 impls space, foo_time, foo {
    let e = encode-begin gpu;
    encode e.copy[x_gpu <- x];
    let f = if b {
        encode e.call[y_gpu <- simple'<1, 1, 1>(x_gpu)];
        submit e
    } else {
        encode e.call[y_gpu <- simple2'<1, 1, 1>(x_gpu)];
        submit e
    };
    (await f).y_gpu
}

You might be wondering what the type of f is, or how we know that when we synchronize on f, we can read from the variable y_gpu.

Well, the Caiman Frontend has record types! Although, admittedly, the only way to construct a record is by encoding commands. The base type of e is actually Encoder'{y_gpu: i32'<storage, map_read>, x_gpu: i32'<storage, copy_dst>}, the base type of f is Future'{y_gpu: i32'<storage, map_read>, x_gpu: i32'<storage, copy_dst>}, and the type of the await expression is {y_gpu: i32}. Recall the syntax Future'R in Caiman is essentially like calling a type “function” with the type argument R in a higher kinded type system. Or in C++ and Java, something like Future<R>. Further note that i32'<storage, map_read> is syntax denoting an i32 with the flags storage and map_read. These flags control where the memory allocation is placed. For example, the flag map_read indicates the buffer is readable by the local device.

Let’s just look at these three statemets for now:

let e = encode-begin gpu;
encode e.copy[x_gpu <- x];
// ..
    encode e.call[y_gpu <- simple'<1, 1, 1>(x_gpu)];
// ..

First observe that records, along with their regular subtype relation ($\preceq$) define a complete lattice. Thus, given two records, $A$ and $B$, we can compute their meet, $C = A \sqcap B$, and we know that $C \preceq A$ and $C \preceq B$. The meet of $A$ and $B$ will be a new record type that has the fields of both $A$ and $B$, provided that any shared field has the same type in both records. By the regular subsumption rule, if we can show that an expression, $e$ has type $\tau$, and that $\tau \preceq \tau’$, then $e$ has type $\tau’$.

So looking at let e = encode-begin gpu;, we start by generating the constraint that e has type Encoder'{}. (An encoder of the empty record). Then looking at encode e.copy[x_gpu <- x];, we also get the constraint (ignoring flags) that e has type Encoder'{x_gpu: i32}. Let’s call $\tau_1 = ${} and $\tau_2 = ${x_gpu: i32}. Now, the standard unification algorithm would fail when unifying $\tau_1$ with $\tau_2$, but our implementation will instead meet the two constraints to produce the type Encoder'{x_gpu: i32}. By subsumption, if e has type Encoder'{x_gpu: i32}, then it also has type Encoder'{}, satisfying all constraints. We can do this again for encode e.call[y_gpu <- simple'<1, 1, 1>(x_gpu)]; to get that the inferred type of e is Encoder'{x_gpu: i32, y_gpu: i32}, ignoring flags.

Now this game of meeting constraints doesn’t work in all situations. Consider this function:

type R = { v: i32'<storage, map_read> };


// records in type signatues must be annotated. Ie, we can't just say `Encoder`

// furthermore, record fields are passed and returned to/from a function 
// in the order they are defined in the type definition.
fn bar_impl(e: Encoder'R) -> i32 impls bar_time, bar, space 
{
    let f = submit e;
    f.v
}

Consider that instead of accessing f.v, we access f.g. This would generate the constraint that f has type Future'{g: i32}. Since f is the future returned by submitting e, then we also get the constraint that e has type Encoder'{g: i32}. Now based on the function signature, we already have the constraint that e has type Encoder'{v: i32}, so if we do the meet trick again, we’d get that e has type Encoder'{v: i32, g: i32}. But since the function signature accepts an Encoder'{v: i32}, then a user might not give us an Encoder'{v: i32, g: i32}!

The problem here is that function arguments are contravariant. In other words, the lowest type in the lattice that e could be is Encoder'{v: i32}. Our implementation uses a special flag to denote contravariant constraints that cannot be unified to a lower type in the lattice.

This system allows us to pass subtypes to functions and return subtypes from functions. For example:

type R = { foo: i32'<storage, map_read>, baz: i32'<storage, map_read> };
type T = { foo: i32 };

fn foo_impl(x: &i32) -> T impls foo_time, foo, space {
    let e = encode-begin gpu;
    encode e.copy[x_gpu <- x];
    encode e.call[foo <- simple'<1, 1, 1>(x_gpu)];
    encode e.call[baz <- simple'<1, 1, 1>(x_gpu)];
    // ignoring flags, e has type Encoder'{x_gpu: i32, foo: i32, baz: i32}
    // and we can pass it to `bar_impl` which accepts types of
    // Encoder'{foo: i32, baz: i32}
    combine_impl(await bar_impl(e))
}

fn bar_impl(e: Encoder'R) -> Future'R impls bar_time, bar, space 
{
    submit e
}

fn combine_impl(r: R) -> T impls combine_time, space, combine {
    // we get a {foo: i32, baz: i32}
    // and we can use it to return a {foo: i32}
    r
}

One (perhaps questionable) design is that the flags are only relevant when the record type is part of an Encoder or Future. So in combine_impl, there’s no difference between R and {foo: i32, baz: i32}.

Speaking of flags, what I described is a slightly simplified model of the real implementation. In truth, each Encoder and Future has separate record types for defined variables, and each flag. So e would actually have a type represented by something like

Encoder'{
    defined: {foo: i32, baz: i32, x_gpu: i32},
    storage: {foo: i32, baz: i32, x_gpu: i32},
    map_read: {foo: i32, baz: i32},
    copy_dst: {x_gpu: i32},
    map_write: {},
    // ...
}

The algorithm keeps track of a list of side conditions which enforces that the defined record of every encoder and future is a subtype of the records for all the flags. These side conditions are checked when a new constraint is added. This catches errors like reading from a variable before it was defined.

Taking a step back, internally, this system represents the type of a record as a set of constraints, and we unify the constraints to produce the most general type that satisfies all of them. Without putting too much thought into it, it seems that a similar system could potentially work for a language that has the look and feel of an untyped language like Python, but provides static typing guarantees by inferring the set of constraints on each variable. For example, some constraint might be that $\tau$ has the operator $+ : \tau \times \tau \rightarrow \tau’$ defined on it. Such information could probably be useful for a JIT compiler as well.

Evaluation and Future Ideas

Overall, this project has made it possible for a user to actually write Caiman programs. This has allowed greater potential for future use and improved test coverage of all parts of the compiler, exposing multiple bugs.

Although the assembly was not designed to be regularly written by hand, the following chart compares the lines of code required for a frontend and assembly implementation of the same exact test program:

LOC Graph

While the frontend has come a long way, there are still things that can be done

Implement holes enabling the interfacing with the explicator.
Implement the spatial specification, the implementation commands that correspond to it, and type inference for spatial tags.
Unary operators. This should be fairly straightforward to implement, I just never got around to them.

Even more far-off ideas:

Improved error messages and VScode extension linting.
The only way for someone to construct a record is by encoding something and then synchronizing on the fence. It wouldn’t be too hard to add records to the spec and allow a user to construct a record locally.
Loops, which could be lowered as tail-recursive function calls.
Tuples, right now the frontend has multiple return values, which is one specific area where tuples are allowed but it would be interesting to have a more generalized tuple type.
Generics, I’d imagine such a system would have to instantiate multiple concrete implementations from the generic implementation present in the source code, similar to C++.

Blog

Contents