Caiman is a language designed for heterogeneous programming that formalizes the communication between devices, allocations, and computations performed. This post details the current state of the Caiman Frontend and includes a brief overview of the implementation of some features such as lowering to continuation passing style and the e-graph-based type inference.
In heterogeneous programming, communication and synchronization between devices is extremely important for performance, but it’s not at all obvious what the optimal communication strategy should be. Take the following simple C program for example:
int selectSum(int * a, int * b) {
int sum_a = sum(a);
if (sum_a > 0) {
return sum(b);
} else {
return sum_a;
}
}
Instead of performing sum
locally, a user might want to execute this on
another device, perhaps a GPU. This might lead to another implementation that
looks like this:
int selectSumGPU(int * a, int * b) {
int sum_a;
sumGPU(a, &sum_a);
barrier();
if (sum_a > 0) {
int sum_b;
sumGPU(b, &sum_b);
barrier();
return sum_b;
} else {
return sum_a;
}
}
But wait, if we’re going to run these computations on the GPU, why don’t we run them in parallel? Maybe let’s trade extra work for increased parallelism and have an implementation like this:
int selectSumGPU2(int * a, int * b) {
int sum_a, sum_b;
sumGPU(a, &sum_a);
sumGPU(b, &sum_b);
barrier();
if (sum_a > 0) {
return sum_b;
} else {
return sum_a;
}
}
Or perhaps, we perform one computation on the GPU, and another computed locally to
still have the two calls to sum
in parallel but reduce the amount of data
transfer between the two devices:
int selectSumGPU3(int * a, int * b) {
int sum_a;
sumGPU(a, &sum_a);
int sum_b = sum(b);
barrier();
if (sum_a > 0) {
return sum_b;
} else {
return sum_a;
}
}
Now we quickly came up with 4 different implementations of the same function,
so which one is fastest? Well, that depends. If it’s very common for
sum_a
to be positive, then perhaps selectSumGPU2
or selectSumGPU3
would be
ideal. But if the sizes of a
and b
are too small, then perhaps selectSum
would be ideal since it doesn’t have the overhead of device
synchronization and communication. The only real way to know what’s fastest for
a given use case is to time them all.
If we were to write a function, foo
that makes two calls to selectSum
, then with
4 implementation choices, that gives us 16 ways to implement foo
. If another
function bar
calls foo
twice, then we have 256 ways of implementing it just
by changing which selectSum
implementation we use. So determining what the
fastest implementation is for a given workload involves an exploration
of an exponentially large search space.
While this example might not seem that motivating, many important algorithms
are composed of basic building blocks similar in spirit to selectSum
.
Caiman solves this program space exploration problem in two ways.
First, it provides specification languages
to specify what something computes (the value language), the synchronization
and communication involved in the computation (the timeline language), and the
allocations and deallocations (in the spatial language). Notice in the
selectSum
example that each of the implementations computed the
same thing and none of them had any allocations or deallocations.
In Caiman, they would all have identical value and spatial specifications, but different timeline specs. This allows you to write implementations
more easily by declaring what specs it implements and getting a guarantee
that the implementation adheres to them. In the selectSum
example, this would
prevent us from implementing a selectSum
that doesn’t compute the correct result.
Secondly, Caiman introduces an explicator that can use the specifications to wholly or partially generate implementations.
My work on this project has been developing a front-end user-facing language for Caiman.
When I came into this project, there was no way for someone to actually use Caiman. For testing, programs could be written in Caiman Assembly, a textual format that is a thin wrapper over Caiman’s intermediate representation (IR). Programs written in this form are extremely verbose and difficult to write. Writing a simple program to compute the minimum of two integers could take nearly 100 lines of code and a few hours to days to implement correctly by hand.
For my project, I designed and implemented a user-level language that someone could use to write Caiman programs. Some of my contributions are:
One of the things that makes writing programs in Caiman Assembly so difficult
is the fact that the IR represents control flow in
continuation passing style
(CPS). In CPS, control flow is made explicit by each function invoking
a continuation to pass control to the next function. These
continuations can be thought of as “the rest of the program”. In CPS, an
if
would be implemented as calling one of
two functions to pass control flow to.
This representation is convenient for the compiler, but unintuitive for
a human to write. For example, a Rust function like this:
fn foo(x: i32) {
let c = if x < 0 {
10
} else {
20
};
if x & 1 == 0 {
c
} else {
c * 2
}
}
would end up looking something like this in CPS:
let foo = |x: i32, k2: &Fn(i32)| {
(|x: i32, k1: &Fn(i32, i32)| {
let cond = x < 0;
(|k3: &Fn(i32), k4: &Fn(i32)| { if cond { k3(x) } else { k4(x) } })
(&|x: i32| k1(x, 10), &|x: i32| k1(x, 20))
}) (x, &|x: i32, c: i32| {
let cond2 = x & 1 == 0;
(|k5: &Fn(i32), k6: &Fn(i32)| { if cond2 { k5(c) } else { k6(c) } })
(&|c: i32| k2(c), &|c: i32| k2(c * 2))
})
};
There is a close correspondence between the CPS and control flow graph (CFG) representation of a program, as seen by the following control flow graph for this example program:
In the CPS example, the starting block (which I’m calling k0
)
uses cond
to select between invoking k3
and k4
, both of which pass their
result to k1
. Then k1
uses cond2
to select between invoking k5
and k6
,
and both of these pass their result to k2
. This is the same structure
depicted in the CFG, so it’s fairly easy to convert between CFG and CPS
representations.
By using a control flow graph as an intermediary step, we can use standard compiler techniques to transform the input Caiman program into a CFG, and then transform the CFG into the CPS form of the Caiman assembly, which is the target language of the Caiman Frontend.
A typical live variables analysis over the CFG representation of the program
is then used to determine the input variables to each continuation. In the
example, this can be seen with how x
is passed from k0
to k2
and k3
,
and then to k1
. And how c
is passed from
k1
to k5
and k6
, and then to k2
.
The Caiman Frontend provides standard arithmetic operators and nested expressions
programmers would be accustomed to from most programming languages. For example,
something like 3 + 4 * (x / 3)
which performs 3 operations in one
expression. Flattening expression trees is a standard thing for compiler frontends
to perform, but in Caiman, all operations must be implemented as functions
on a particular device. So we lower operators to function calls of
standard library functions that run on the local device.
The Caiman Frontend also enables recursive functions. In Caiman Assembly, recursion needs to have yield points, which gives control back to the user. The operational model of Caiman is that the user writes a pipeline, which may not have cycles. To implement loops, that pipeline would be invoked repeatedly. So a yield point gives control back to the user so they can invoke the pipeline again or get the result when the recursion stops.
The Caiman Frontend will construct a call graph and automatically insert yield points where recursion is detected, allowing a user to not have to worry about the underlying technical details that enable recursion.
For example, this is how one could implement a recursive sum that computes the sum from a
to b
in Caiman:
#version 0.1.0
// timeline spec, no synchronizations so the spec is trivial
tmln time(e: Event) -> Event { returns e }
// spatial spec, no allocations so the spec is trivial
sptl space(s: BufferSpace) -> BufferSpace { returns s }
// value spec, basically describes the actual computation being done in a way
// that's agnostic to allocations and synchronizations
val rec_sum(a: i64, b: i64) -> i64 {
// observe that specifications invoke other specifications,
// NOT implementations, allowing a single spec to have
// many different implementations. So we use `rec_sum`
// here but `rec_sum_impl` in the implementation
// specifically, a specification will call a function equivalence
// class. Any spec that isn't a member of an explicit user-defined
// equivalence class is a member of its own equivalence class
// which shares its name
returns 0 if a > b else
a if a == b else
rec_sum(a + 1, b) + a
}
// The specs the schedule implements
// implementation (aka the schedule) v
fn rec_sum_impl(a: i64, b: i64) -> i64 impls rec_sum, time, space {
// no return keyword, the final expression is the return value
if a > b {
0
} else if a == b {
a
} else {
rec_sum_impl(a + 1, b) + a
}
}
// the main pipeline is essentially the entry point to the program
pipeline main { rec_sum_impl }
Excluding whitespace and comments, this implementation takes 18 lines of code, compared to the 131 lines of code required to implement the same thing in the textual representation of the Caiman Assembly.
As this example shows, this work enables nested control flow in the
specifications and implementations and nested expressions, such as
rec_sum_impl(a + 1, b) + a
, which would otherwise need to be
implemented as a series of commands in Caiman Assembly.
In Caiman, if something is explicitly mentioned in the spec, then it must
occur in the implementation. For example, consider this function which
returns 2
in a convoluted way:
val min() -> i64 {
returns (2 if 2 < 1 else 1) + 1
}
fn min_impl() -> i64 impls min, time, space
{
// declare a constant
let a = 2;
// The Caiman Frontend allows if branches to return a value and store
// it in a variable in the parent scope. This works like Rust
let r = if 2 < 1 {
a
} else {
1
};
// the return value is the final expression (no terminating semicolon)
// again, this is the same as Rust
r + 1
}
We could not have an implementation that just returns 2, even though that’s essentially what happens. Every operation that we have in the spec must also occur in the implementation. Moreover, this implementation would also not work:
fn min_impl() -> i64 impls min, time, space
{
let a = 2;
let r = if a < 1 {
a
} else {
1
};
r + 1
}
That’s because we specified two distinct uses of the constant 2
but only
have one shared use in this implementation. The Caiman Frontend could generate
code in a way that this would work, however we didn’t want any kind of pass
to change what the user specified.
As a more complicated example, let’s see our motivating example of select sum in Caiman. For simplicity, we’re going to focus on computing summations between two bounds.
#version 0.1.0
tmln time(e: Event) -> Event { returns e }
sptl space(s: BufferSpace) -> BufferSpace { returns s }
// function equivalence class, declaring that these
// are all equivalent
feq rec_sum {
extern(cpu) pure rec_sum_cpu(i32, i32) -> i32
extern(gpu) rec_sum_gpu(i32, i32) -> i32 {
path: "rec_sum_gpu.comp",
// the number of dimensions is the dimensions of the grid
// to dispatch this kernel to
dimensions: 3,
// ...
// some details about the external
// GPU kernel we're calling omittted
}
// rec sum spec from before
val rec_sum(a: i32, b: i32) -> i32 {
returns 0 if a > b else
a if a == b else
// special _dim constants are available for specifications
// (and schedules that implement them) which are in the same
// function class as a GPU extern
// this is a call to the `rec_sum` function equivalence class
rec_sum'<_dim0, _dim1, _dim2>(a + 1, b) + a
}
}
// rec sum implementation from before, now just with
// dimension constants since the specification `rec_sum` is a member of
// the `rec_sum` equivalence class which contains a GPU extern
// that takes 3 dimensional arguments
fn rec_sum_local(a: i32, b: i32) -> i32 impls rec_sum, time, space {
if a > b {
0
} else if a == b {
a
} else {
rec_sum_local'<_dim0, _dim1, _dim2>(a + 1, b) + a
}
}
// value spec for select sum
val select_sum(a: i32, b: i32, c: i32) -> i32 {
// the specification invokes the equivalence class
// `feq rec_sum`, and an implementation can choose to invoke
// `rec_sum_cpu`, `rec_sum_gpu`, or an implementation
// of the spec `rec_sum`
// as such, we must specify the grid size since some
// implementation may choose to invoke the GPU version
x :- rec_sum'<1, 1, 1>(a, b)
returns rec_sum'<1, 1, 1>(b, c) if x < 0 else x
}
// `selectSum` in our motivating example
fn select_sum_local(a: i32, b: i32, c: i32) -> i32 impls select_sum, time, space {
let x = rec_sum_local'<1, 1, 1>(a, b);
if x < 0 {
rec_sum_local'<1, 1, 1>(b, c)
} else {
x
}
}
// timeline spec for a select sum with two encodings, submissions
// and synchronizations in parallel on every control path
tmln double_sync(e: Event) -> Event {
// encode_event, submit_event, and sync_event
// are special functions provided in the timeline language
// the first argument and first return of a timeline spec
// are implicit events. This can be thought of as tracking the
// latest thing that the coordinator knows about.
loc1, rem1 :- encode_event(e)
sub :- submit_event(rem1)
loc2, rem2 :- encode_event(loc1, sub)
sub2 :- submit_event(rem2)
f1 :- sync_event(loc2, sub)
f2 :- sync_event(f1, sub2)
returns f1
}
// `selectSumGPU`
fn select_sum_gpu(a: i32, b: i32, c: i32) -> i32
// we still implement `select_sum` and `space`, but
// now we implement `double_sync` instead of `time`
impls select_sum, double_sync, space
{
// encoding requires we supply a reference, and we cannot
// take references of constants so we copy the variables here
var a = a;
var b = b;
var c = c;
// start encoding a command to the GPU
let e = encode-begin gpu;
// encode a copy command which will copy a reference to a local variable
// to the device we are encoding commands to
encode e.copy[a_gpu <- &a];
encode e.copy[b_gpu <- &b];
// encode the invocation to the `rec_sum_gpu` kernel
// this kernel has 3 dimensions, x, y, and z. The
// '<1, 1, 1> here is setting the grid dimensions
// of the kernel dispatch to be a 1x1x1 grid
// Basically, we're computing this on the GPU but not
// actually using the data parallelism of the GPU.
encode e.call[x_gpu <- rec_sum_gpu'<1, 1, 1>(a_gpu, b_gpu)];
// dispatch the buffered commands and get a fence that we can
// wait on to get the result
// the commands we encode don't take effect until they're
// submitted. Submitted commands are executed concurrently
let f1 = submit e;
// make another encoding
// each encoder has its own scope, so `b_gpu` doesn't conflict with
// `b_gpu` from the earlier encoder
let e2 = encode-begin gpu;
encode e2.copy[b_gpu <- &b];
encode e2.copy[c_gpu <- &c];
encode e2.call[x_gpu <- rec_sum_gpu'<1, 1, 1>(b_gpu, c_gpu)];
let f2 = submit e2;
// synchronize on the first result and copy `x_gpu` from the
// GPU back to the local device
let x = (await f1).x_gpu;
// synchronize on the second result
let r = (await f2).x_gpu
if x < 0 {
r
} else {
x
}
}
pipeline main { select_sum_gpu }
Caiman uses a single quote followed by angled brackets for setting the grid and block sizes of a kernel dispatch. This syntax came about since it’s not ambiguous like C++’s template argument syntax and is (in my opinion) less ugly than CUDA’s triple angle brackets. Although not shown here, the single quote followed by angled brackets or a single quote followed by a type is the general syntax approach taken for things that can be thought of as “template arguments”.
GPU variables are syntactically and semantically in another scope. This
prevents a user from mixing up local and encoded variables. For example,
in select_sum_gpu
, a user could declare a new variable a_gpu
, which would
not shadow the existing a_gpu
that is encoded. Furthermore, b_gpu
from e2
doesn’t conflict with b_gpu
from encoder e
, since
the two encoders effectively have different scopes. The idea here is to make
the encoded commands almost like a tiny DSL
embedded into Caiman. The
slightly different syntax emphasizes the different semantics of encoding
a command to be executed later on another device versus executing something
immediately on the local device.
One design decision I made is that an encoded call must use arguments that
have already been encoded into the command buffer with e.copy[]
, and expressions
on the right-hand side of <-
cannot be nested. This isn’t
necessary, as the compiler could insert copies for you, but I didn’t
like the implicit nature of this. For example, suppose the compiler did insert copies
as necessary, then we wouldn’t need an encode copy construct at all. Syntactically looking at
something like encode e[y <- kernel'<x, y, z>(local_buffer, g + y)]
,
a user might not realize that local_buffer
is copied over to another device, but
more importantly, it becomes unclear of whether g
and y
are copied over and
then added on the other device or if g + y
is computed locally
and then the result is copied.
For e.copy
, nested expressions are allowed on the right-hand side of <-
,
in which case they are flattened on the local device and the final result is copied.
I found this to be OK since the idea of copy is taking something from the local device
and putting it into something on the remote device.
A timeline spec has 3 different types, encoders, fences, and events. Events are essentially local events that the coordinator is aware of, encoders are used to encode commands to a remote device and fences are single-use synchronization points between two devices.
When writing select_sum_gpu
in this example, I actually made a mistake
and switched the
branches of the if
, which was quickly caught by the compiler
providing linting to VScode via a custom language extension.
The Caiman Frontend uses immutability as the default, but it also provides references and variables that allow mutation. These features hide the allocation of temporaries that is made explicit in the Caiman Assembly.
Variables in Caiman are mutable values, but since they are implemented as a location on the stack, a dereference pass is used to insert loads and stores to the variable’s location on the stack as needed. This is done via a dataflow analysis which inserts a load right before the first use of a variable since it was last written.
Using variables, we can compute the minimum of two numbers like so:
val min(a: i64, b: i64) -> i64 {
returns a if a < b else b
}
fn min_impl(a: i64, b: i64) -> i64
impls min, time, space
{
var v;
if a < b {
v = a;
} else {
v = b;
}
v
}
The Frontend supports taking references of variables and uses C/C++/CUDA syntax, as this is likely what a hypothetical Caiman user would be very familiar with:
val main() -> i64 {
c :- 50
returns one(c)
}
val one(a: i64) -> i64 {
returns a + 1
}
fn main_func() -> i64 impls main, time, space
{
var x = 50;
*one_func(&x)
}
fn one_func(a: &i64) -> &i64 impls one, time, space
{
// load the value from `a`, add 1,
// then store it back into `a`
*a = *a + 1;
// return the reference
a
}
One thing to note is that the specification one
doesn’t know anything
about the fact that the implementation, one_func
is implemented with
references. We could have written an equally good implementation without
references. This still type-checks however because &i64
refines the i64
specified in the specification.
The Caiman type system does not allow a reference to be captured across a functioin call. For example, the following implementation would not be allowed:
fn foo() -> i64 impls // ...
{
let x: &i64 = // ...
bar();
*x + 1
}
Despite that variables are references under the hood, the Caiman Frontend does let you do this with variables. To support this, the Frontend will insert loads and stores so that a value, not a reference is captured. So under the hood, we convert code like this:
fn foo() -> i64 impls // ...
{
var x = // ...
bar();
x + 1
}
to
fn foo() -> i64 impls // ...
{
var x = // ...
let _x = x;
// _x is captured across the function call, which is a value
bar();
var x = _x;
x + 1
}
As alluded to earlier, one part of this work was designing a new higher-level intermediate representation (HIR). This IR served as a stepping stone between the frontend abstract syntax tree (AST), and the Caiman Assembly.
I chose a CFG and three-address-code representation, a common design where the program is split into basic blocks of instructions, each instruction performs a single operation (typically having two arguments and a destination, hence the three addresses in three-address-code) and any control flow is the last instruction in the block. Control flow between basic blocks is represented as edges in a control flow graph.
Like the Caiman Assembly, this IR makes loads and stores explicit, which enables the aforementioned implicit load insertion pass. It also is in single-static-assignment (SSA) form, where a variable is only assigned once.
The HIR also includes a generalized dataflow solver, which can be parameterized on various dataflow analyses.
A major feature that makes writing Caiman programs bearable is type inference. So far, all of the examples we’ve seen have used type inference to avoid writing down the verbose types in Caiman. A type in a Caiman specification is just the base type (what most programmers would think of when they hear “type”), but a type in a Caiman implementation includes these components:
i64
, &i32
, or bool
i64'<storage, map_read>
denotes an i64
that is held on the
GPU which is mapped for reading from the local device.node(val.x)
is the type associating a piece of the
implementation to variable x
in the value specification, while input(tmln.y)
associates a piece of the implementation to the input variable
y
in the timeline specification.usable
denotes that a resource can be consumed, while
dead
denotes that the resource is empty.The tags of a variable can change as it is assigned to or passed between
basic blocks. Furthermore, a tag is not just applied to variables but is also
used to “type” control flow such as if
and function calls. For example, here is the min
example from earlier with explicit
type annotations for everything (tags are denoted after the @
symbol):
val min(a: i64, b: i64) -> out: i64 {
c: bool :- a < b
ret: i64 :- a if c else b
returns ret
}
fn min_impl(
a: i64 @ [input(val.a)-usable, none(tmln)-usable, none(sptl)-usable],
b: i64 @ [input(val.b)-usable, none(tmln)-usable, none(sptl)-usable]) ->
i64 @ [node(val.out), none(tmln)-usable, none(sptl)-usable]
impls min, time, space
{
var v: i64 @ [none(val)-dead, none(tmln)-usable, none(sptl)-saved];
let c: bool @ [node(val.c)-usable, none(tmln)-usable, none(sptl)-usable]
= a < b;
if @ [node(val.ret), none(sptl), none(tmln)] c {
v = a;
@out { v: [node(val.a)-usable, none(tmln)-usable, none(sptl)-saved] };
} else {
v = b;
@out { v: [node(val.b)-usable, none(tmln)-usable, none(sptl)-saved] };
}
@in { v: [node(val.ret)-usable, none(tmln)-usable, none(sptl)-saved] };
v
}
You don’t have to understand what these types mean, but the key thing
to note is that it’s time-consuming to write by hand and that it’s somewhat
non-trivial: in this example, v
changes its quotient and flow 3 times.
The type inference algorithm I implemented is basically
Hindley-Milner.
My implementation is basically a tree matching over an
abstract set of type nodes. Like most things in PL, we can think of a type as a tree.
For example, the type i32
could be represented as the tree of order 2 and size 1,
with a root node for int
and a child leaf for the size of the integer.
int
|
|
i32
The only difference between working with base types and working with quotients is the set of vertices. So, by abstracting away this set we can have the same code work for both data type deduction and quotient deduction. Flows are deduced separately, via a dataflow analysis. Certain instructions change the flow predictably, and the new flow stays until a later instruction in the program changes the flow again.
Deducing quotients is a bit more involved, and since the base type inference is a strict subset of the quotient deduction, we’ll continue our discussion with the quotient deduction.
Specifically, the implementation of the type inference algorithm is based on e-graphs. We have a union-find data structure where the canonical representative of an equivalence class is mapped to an e-class id, in this case, these IDs will be variables in a specification.
Since tags can change as we write to references or update variables, we first convert the program to a reference SSA form, where variables are assigned and references stored, only once. This gives every value of a variable a distinct name we can use to refer to it during type inference. Then we scan through each spec building up an e-graph. Next, we scan through the implementations and build up a disconnected e-graph. Eventually, something will ensure that a part of the implementation matches a part of the spec and we’ll unify the two graphs.
Consider the following pseudo-Caiman program:
val foo(x: i64) -> out: i64 {
d :- x < 0
left :- x * 2
right :- x - 2
ret :- left if d else right
returns ret
}
fn foo_impl(x: i64) -> i64 {
let d = x < 0;
if {
let left = x * 2;
let _out.0 = left;
} else {
let right = x - 2;
let _out.1 = right;
}
// phi nodes are not something a user can use,
// they exist only in the HIR
let ret = phi _out.0 _out.1;
let _out.2 = ret;
_out.2
}
Going line by line, if we look at the line d :- x < 0
in the spec,
we know that a spec variable
$d
is going to map to an expression like Lt(x, 0)
.
The dollar sign here denotes the id of an equivalence class, whereas
a name without a dollar sign just represents a variable. Now we
also know that $x
is the equivalence class id of the
input variable x
in the spec. This gives the following graph,
depicting two equivalence classes where dotted lines
represent nodes in the same equivalence class and solid edges connect parent
nodes in the type expression tree to their children:
Now when we look at the child of Lt(x, 0)
, x
, we can unify these two
equivalence classes to produce a graph like so:
Now the variable x
is equivalent to the node Input(x)
with the e-class id
of $x
, so we can represent this as so:
Continuing this process for the spec and implementation, we reach the following e-graph where the spec is depicted on the left and the implementation on the right and class ids are denoted with a “$” whereas variables are denoted with a “%”.
At this point, we know that the variable %_out.2
must be
equivalent to the spec variable $out
since %_out.2
is returned from the
implementation and $out
is the spec’s return value. So we can unify the two
classes as shown:
But once we do that, we know that the Return
type expression of the implementation
is equivalent to the Return
type expression in the implementation
(depicted with the existence of a path between these two nodes consisting solely
of dotted edges). So, we can unify the corresponding children of these nodes,
in this case the Select
nodes:
We can keep recursively applying this transformation until we get the following graph:
Now, if we were to look up the class-id of the
implementation variable %d
, we see that it is in an equivalence class with
the id $d
. Thus, the quotient for
%d
must be node(val.d)
. Unification can fail if we attempt
to unify two nodes that do not match, for example, if we try and unify
Input(x)
with Mul(Input(x), 2)
. If this happens, we give the user
an error; otherwise, we continue and use the e-graph to fill in annotations
in our program. This makes the tags easily accessible for the final lowering pass
into Caiman Assembly, which doesn’t happen when the program is in SSA form.
Now consider a specification like this:
val foo() -> i64 {
res1, res2 :- bar()
returns res1
}
and imagine an implementation which together with the above spec generates the following e-graph:
After unifying the return values, we might get something that looks like this:
Notice that the implementation variable %b
doesn’t match up to anything in
the specification. However, we can also observe that the specification contains
only one subtree of the form Extract(1, bar())
, which is in the class with
id $res2
. So, if we come across a matching subtree in the implementation,
then since there is a unique, isomorphic tree in the specification, we know that
those subtrees have to match. This would automatically give us that %b
is
equivalent to $res2
, by the logic that there is nothing else it could possibly
be equivalent to.
This unification strategy is a more bottom-up approach as it starts with nodes that must be equivalent and uses that equivalence to build up equivalences between larger trees. The earlier approach I described is a more top-down unification style as it starts with two trees that must be equivalent and uses that to infer equivalences of subtrees.
The Caiman Frontend does both. It starts with a top-down unification pass, after which, it will continually unify subtrees that have unique matches in a bottom-up style until every implementation variable either belongs to an e-class or has multiple potential matches. This allows an implementation of the previous value spec to be done without any annotations.
A similar process happens for type-checking base types and flags except they don’t need e-class ids. Moreover, since the base types and flags of a variable cannot change, this form of type inference is performed at the AST level, before it’s lowered down into HIR. Since things like quotient deduction and the load insertion pass require knowing information about every variable’s base type, the data type inference needs to happen sooner. While it could be done at the HIR, doing it at the AST level allows stripping away data type annotations when lowering the AST to the HIR, simplifying the HIR.
There are three notable situations where a user must supply type annotations:
These all boil down to the following general guidelines:
These restrictions prevent type inference from becoming a backtracking algorithm that must search through an exponentially large search space. Further, they also enable a type inference algorithm that won’t make arbitrary choices that a user might not desire. Other than these situations, a user can get away with letting the compiler figure out the types for everything else that’s currently implemented in the language.
In regards to a function call which doesn’t have a matching call in the spec, here’s an example of implementing one spec across two implementation functions:
// ...
val foo(x: i64) -> i64 {
g :- x + 10
h :- g * 2
k :- h + g
returns k / 2
}
fn foo1(z: i64) -> i64 impls foo, time, space {
// annotation of result of function call required here,
// note there is no call in the spec of `foo`
let g @ node(val.g), h @ node(val.h) = foo2(z);
(h + g) / 2
}
fn foo2(a: i64 @ node(val)) ->
(i64 @ node(val.g), i64 @ node(val.h))
impls foo, time, space
{
let g = a + 10;
(g, g * 2)
}
pipeline main { foo1 }
Another form of this is how one could implement a function that uses output parmeters:
val main() -> i64 {
returns one()
}
val one() -> i64 {
returns 10
}
fn main_func() -> i64 impls main, time, space
{
var x;
*one_func(&x)
}
// the spec `one` doesn't have an argument that `a` corresponds to,
// thus it's quotient is `none`. It's flow is `dead` because there is no
// value contained within the variable when it's passed to the function
fn one_func(a: &i64 @ none(val)-dead) -> &i64
impls one, time, space
{
*a = 10;
a
}
Notice how the call to one_func
from main_func
doesn’t need any annotations,
despite the fact that we are trying to unify one_func(%x)
with
one()
. The classic unification algorithm would fail here, but we support this.
If a call in the schedule is unifying with a call in the spec that has n
less arguments, we assume that the first n
arguments that are unconstrained
during the top-down pass are none
, and drop them during unification. This may seem
strange, but note that in the top-down pass, the only way for the arguments to be
unified before the function call is if they have user-supplied annotations or
are used in something that was given a type annotation by the user. So this
system allows a user to precicesly control it by providing type
annotations.
Here’s an example of a function that has isomorphic unconstrainted subtrees:
val foo(x: i64) -> i64 {
a :- 1
b :- 1
c :- 1
returns x
}
fn foo_impl(z: i64) -> i64 impls foo, time, space {
// no way to tell which `1` we mean, so we need annotations
// type inference will not make arbitrary choices
// (except for assuming that inputs and outputs match up with the spec
// unless otherwise specified)
let x @ node(val.a) = 1;
let y = 1 @ node(val.b);
let h = 1 @ node(val.c);
z
}
Let’s look at a simple program that uses the timeline spec:
extern(gpu) simple(x : i32) -> out: i32
{
dimensions: 3,
// ..
}
extern(gpu) simple2(x : i32) -> out: i32
{
dimensions: 3,
// ..
}
// name the output parameter so
// we can refer to it in a type annotation
// v
val foo(x: i32, b: bool) -> out: i32
{
y :- simple'<1, 1, 1>(x)
y2 :- simple2'<1, 1, 1>(x)
ret :- y if b else y2
returns ret
}
tmln foo_time(e: Event) -> out: Event {
loc1, rem1 :- encode_event(e)
sub :- submit_event(rem1)
snc :- sync_event(loc1, sub)
returns snc
}
sptl space(bs: BufferSpace) -> out: BufferSpace {
returns bs
}
fn foo_impl(x: &i32, b: bool) -> i32 impls space, foo_time, foo {
let e = encode-begin gpu;
encode e.copy[x_gpu <- x];
let f = if b {
encode e.call[y_gpu <- simple'<1, 1, 1>(x_gpu)];
submit e
} else {
encode e.call[y_gpu <- simple2'<1, 1, 1>(x_gpu)];
submit e
};
(await f).y_gpu
}
You might be wondering what the type of f
is, or how we know that when
we synchronize on f
, we can read from the variable y_gpu
.
Well, the Caiman Frontend has record types!
Although, admittedly, the only way to construct a record is by encoding commands.
The base type of e
is
actually Encoder'{y_gpu: i32'<storage, map_read>, x_gpu: i32'<storage, copy_dst>}
,
the base type of f
is Future'{y_gpu: i32'<storage, map_read>, x_gpu: i32'<storage, copy_dst>}
,
and the type of the await expression is {y_gpu: i32}
. Recall the syntax
Future'R
in Caiman is essentially like calling a type “function” with the type argument R
in a
higher kinded type system.
Or in C++ and Java, something like Future<R>
. Further note that i32'<storage, map_read>
is syntax denoting an i32
with the flags storage
and map_read
. These flags
control where the memory allocation is placed. For example, the flag
map_read
indicates the buffer is readable by the local device.
Let’s just look at these three statemets for now:
let e = encode-begin gpu;
encode e.copy[x_gpu <- x];
// ..
encode e.call[y_gpu <- simple'<1, 1, 1>(x_gpu)];
// ..
First observe that records, along with their regular subtype relation ($\preceq$) define a complete lattice. Thus, given two records, $A$ and $B$, we can compute their meet, $C = A \sqcap B$, and we know that $C \preceq A$ and $C \preceq B$. The meet of $A$ and $B$ will be a new record type that has the fields of both $A$ and $B$, provided that any shared field has the same type in both records. By the regular subsumption rule, if we can show that an expression, $e$ has type $\tau$, and that $\tau \preceq \tau’$, then $e$ has type $\tau’$.
So looking at let e = encode-begin gpu;
, we start by generating the constraint
that e
has type Encoder'{}
. (An encoder of the empty record). Then looking
at encode e.copy[x_gpu <- x];
, we also get the constraint (ignoring flags)
that e
has type Encoder'{x_gpu: i32}
. Let’s call $\tau_1 = ${}
and
$\tau_2 = ${x_gpu: i32}
. Now, the standard unification algorithm would fail
when unifying $\tau_1$ with $\tau_2$, but our implementation will instead meet
the two constraints to produce the type Encoder'{x_gpu: i32}
. By subsumption,
if e
has type Encoder'{x_gpu: i32}
, then it also has type Encoder'{}
,
satisfying all constraints. We can do this again for
encode e.call[y_gpu <- simple'<1, 1, 1>(x_gpu)];
to get that the inferred
type of e
is Encoder'{x_gpu: i32, y_gpu: i32}
, ignoring flags.
Now this game of meeting constraints doesn’t work in all situations. Consider this function:
type R = { v: i32'<storage, map_read> };
// records in type signatues must be annotated. Ie, we can't just say `Encoder`
// furthermore, record fields are passed and returned to/from a function
// in the order they are defined in the type definition.
fn bar_impl(e: Encoder'R) -> i32 impls bar_time, bar, space
{
let f = submit e;
f.v
}
Consider that instead of accessing f.v
, we access f.g
. This would generate
the constraint that f
has type Future'{g: i32}
. Since f
is the future
returned by submitting e
, then we also get the constraint that e
has type
Encoder'{g: i32}
. Now based on the function signature, we already have the constraint
that e
has type Encoder'{v: i32}
, so if we do the meet trick again, we’d get
that e
has type Encoder'{v: i32, g: i32}
. But since the function signature
accepts an Encoder'{v: i32}
, then a user might not give us an Encoder'{v: i32, g: i32}
!
The problem here is that function arguments are contravariant. In other
words, the lowest type in the lattice that e
could be is Encoder'{v: i32}
. Our implementation
uses a special flag to denote contravariant constraints that cannot be unified to
a lower type in the lattice.
This system allows us to pass subtypes to functions and return subtypes from functions. For example:
type R = { foo: i32'<storage, map_read>, baz: i32'<storage, map_read> };
type T = { foo: i32 };
fn foo_impl(x: &i32) -> T impls foo_time, foo, space {
let e = encode-begin gpu;
encode e.copy[x_gpu <- x];
encode e.call[foo <- simple'<1, 1, 1>(x_gpu)];
encode e.call[baz <- simple'<1, 1, 1>(x_gpu)];
// ignoring flags, e has type Encoder'{x_gpu: i32, foo: i32, baz: i32}
// and we can pass it to `bar_impl` which accepts types of
// Encoder'{foo: i32, baz: i32}
combine_impl(await bar_impl(e))
}
fn bar_impl(e: Encoder'R) -> Future'R impls bar_time, bar, space
{
submit e
}
fn combine_impl(r: R) -> T impls combine_time, space, combine {
// we get a {foo: i32, baz: i32}
// and we can use it to return a {foo: i32}
r
}
One (perhaps questionable) design is that the flags are only relevant when the
record type is part of an Encoder
or Future
. So in combine_impl
, there’s no
difference between R
and {foo: i32, baz: i32}
.
Speaking of flags, what I described is a slightly simplified model of the real
implementation. In truth, each Encoder
and Future
has separate record
types for defined variables, and each flag. So e
would actually have a type
represented by something like
Encoder'{
defined: {foo: i32, baz: i32, x_gpu: i32},
storage: {foo: i32, baz: i32, x_gpu: i32},
map_read: {foo: i32, baz: i32},
copy_dst: {x_gpu: i32},
map_write: {},
// ...
}
The algorithm keeps track of a list of side conditions which enforces that
the defined
record of every encoder and future is a subtype of the records
for all the flags. These side conditions are checked when
a new constraint is added. This catches errors like reading from a variable
before it was defined.
Taking a step back, internally, this system represents the type of a record as a set of constraints, and we unify the constraints to produce the most general type that satisfies all of them. Without putting too much thought into it, it seems that a similar system could potentially work for a language that has the look and feel of an untyped language like Python, but provides static typing guarantees by inferring the set of constraints on each variable. For example, some constraint might be that $\tau$ has the operator $+ : \tau \times \tau \rightarrow \tau’$ defined on it. Such information could probably be useful for a JIT compiler as well.
Overall, this project has made it possible for a user to actually write Caiman programs. This has allowed greater potential for future use and improved test coverage of all parts of the compiler, exposing multiple bugs.
Although the assembly was not designed to be regularly written by hand, the following chart compares the lines of code required for a frontend and assembly implementation of the same exact test program:
While the frontend has come a long way, there are still things that can be done
Even more far-off ideas: