Ensuring good test coverage of a compiler is hard. That’s why I developed Bear, a compiler fuzzer that generates random test cases for compiler optimization passes. (Named as such because bears are fuzzy?) Bear uses genetic programming to evolve novel programs in the search space. Bear generates programs in a custom DSL (domain-specific language) known as Bare C before being lowered into the target IR.
The IR Bear currently targets is BRIL (Big Red Intermediate Language), which is the IR used in Cornell’s PhD level compilers course. As part of that class, students write various compiler passes optimizing BRIL programs including local value numbering (LVN), dead code elimination (DCE), loop-invariant code motion (LICM), SSA conversion, and induction variable elimination. As such, these are the optimizations that Bare C was designed to be expressive enough to stress.
Programs are generated in the Bear DSL, Bare C, which are then lowered to the BRIL intermediate
representation. The lowered program can then be run directly by the BRIL reference
interpreter and the results can be compared with the result of running the program by the
compiler being tested. Differences in the printed stdout
between the two programs
would indicate an error in the compiler under test.
Since we desire Bear to work with most, if not all, BRIL compiler pipelines, it will invoke the compiler under test via the command line. Therefore, Bear always generates terminating, error-free programs that can run in a relatively short amount of time so that every program we take the time to run will provide us with a valid test.
Before we continue, let’s quickly discuss PCFGs or probabilistic context-free grammars. A PCFG is basically a context-free grammar where every production of a nonterminal is assigned a probability. The sum of all the probabilities of a nonterminal sum to one.
Consider the following example:
expr ::= expr + expr .1
| expr - expr .1
| expr * expr .05
| var .25
| num .5
If we were to generate a random expression from the expr
nonterminal,
then there would be a 50% chance that we generate a number, a 25% chance
we generate a variable, and so on.
Now notice that if we were to generate an addition operator, the left and right child expressions would have the same chances of producing a given production. This might not make sense in all circumstances. For example, we might want our PCFG to encode only left associativity, and therefore the probability of the right child of an addition generating another addition should be 0.
To achieve this, we can convert our PCFG to a bigram (as opposed to the unigram shown earlier) by making each child nonterminal a new nonterminal that has its own set of probabilities.
expr ::= expr1 + expr2 .1
| expr3 - expr4 .1
| expr5 * expr6 .05
| var .25
| num .5
expr1 ::= expr1 + expr2 .1
| expr3 - expr4 .1
| expr5 * expr6 .05
| var .25
| num .5
expr2 ::= expr1 + expr2 0
| expr3 - expr4 0
| expr5 * expr6 0
| var .33
| num .66
...
While the grammar itself is context-free, we are essentially making the probabilities context-dependent. For even more context, we could repeat this process for the child nonterminals and construct a trigram that would have information about the last three parent productions, not just the last two as in a bigram or the last one as in a unigram.
With a PCFG, we can easily generate random programs by random probing. Given the above bigram, suppose we want to generate an expression. We start by randomly generating one of the top-level productions, weighted by their probability.
So we start by choosing a production for expr
. With a 10% probability,
suppose +
is chosen. Now we must choose a production for expr1
and expr2
.
For, expr1
, suppose that var
is chosen and for expr2
suppose that num
is chosen. The total probability of generating such an expression is
0.1 * 0.25 * 0.66 = 0.0165
. Observe that this method inherently biases towards
shorter programs because, with each choice, there is a chance that a terminal
production is chosen, preventing the expression from growing deeper.
Genetic programming is a method of generating programs by a genetic algorithm, which is a biological-inspired stochastic optimization method. Genetic algorithms come in handy when the fitness that you are trying to optimize is not differentiable.
In genetic programming, a population of programs is evolved via mutation, crossover, or reproduction according to their fitness, which is a score of how good (or fit) the program is. For example, when trying to create a program that controls a bipedal robot, a possible fitness score for a program could be how far the robot moved when the program ran.
During mutation, a program in the old generation is copied to the new generation with some small changes such as changing a node in the AST. During crossover, two programs in the previous generation create two new programs in the next generation by mixing their genes. For example, swapping subtrees in the parent program’s ASTs. During reproduction, a parent program in the original generation is simply cloned, as is, into the new generation. (Why this isn’t just called cloning or mitosis is beyond me).
Not every program in a generation has equal chances of sending their genes to the next generation. Typically, a subset of the population is selected to spawn the next generation based on their fitness. A common method, known as roulette selection is to select a subset at random, weighting the probability of an individual being chosen based on their fitness.
In pseudocode, the basic structure of genetic programming has the following form:
population = [random_program() for _ in range(population_size)]
while True:
fitnesses = [fitness(p) for p in population]
population = select(population, fitnesses, population_size)
population = [reproduce(population) for _ in range(population_size)]
def reproduce(population):
with probability p_mutate:
return mutate(random.choice(population))
with probability p_crossover:
return crossover(random.choice(population), random.choice(population))
The Bear DSL (referred to as Bare C), is a language designed to stress complex control flow and data dependencies in BRIL. It consists of basic arithmetic and boolean operators, conditionals, for, while and do-while loops, switches, break, continue, intra-procedural (local) try-catch, throw, print, and return. Values in the DSL can be one of two types: 64-bit signed integers or booleans. To always generate correctly typed programs, the grammar of the language enforces that only correctly typed programs can be formed into valid DSL abstract syntax trees.
Bare C also includes a special syntactic construct for Duff’s Device to generate programs with irreducible control flow. To ensure we can generate programs that trigger the optimizations we want to test, Bare C has non-terminals for redundant expressions, loop-invariant expressions (expressions whose values are constant from one loop iteration to the next), and induction variables.
The DSL also has a dead block which contains only dead code. Dead code does not have any side effects (such as prints) nor does it modify any variables that may be necessary for a computation with side effects.
Below is an excerpt from the Bare C grammar, using regular expressions *
, +
, and ?
for brevity.
aexpr ::= aexpr + aexpr
| aexpr - aexpr
| aexpr / aexpr
| aexpr * aexpr
| n | redundant(aexpr) | loop-invariant(aexpr) | induction-var(aexpr * aexpr + aexpr)
| var
;;
bexpr ::= aexpr cmp aexpr
| ...
;;
expr ::= aexpr | bexpr
;;
statement ::= var := expr
| return expr?
| throw(n) expr?
| print expr+
;;
loop-statement ::= statement
| break(n)
| continue(n)
| step
;;
block ::= statement
| if bexpr1 { block* } else { block* }
| do { loop-block* } while var cmp aexpr1
| while var cmp aexpr1 { loop-block* }
| for var in aexpr1 .. aexpr1 by aexpr2 { loop-block* }
| match aexpr3 { (n => block*)+ }
| try { block* } catch var? { block* }
| switch aexpr3 { do { (case n: loop-block*)+ }
for var in aexpr1 .. aexpr1 by aexpr2 } # Duff's Device
| dead { block* }
;;
program ::= block+
One thing to note is that the grammar is neither a unigram nor a bigram
model. In the above grammar, you can see some repeated productions such as
loop-statement
, loop-block
, aexpr1
, aexpr2
, and bexpr1
. However, many things
are not repeated to avoid an explosion of probabilities. The nonterminals that
are repeated are those that I thought would benefit the most from having more
context information to invoke the optimizations I intended to test.
To generate programs in the Bear DSL, we randomly probe a PCFG, generating programs top-down. While the grammar of Bare C enforces things like break and continue only occur in the body of loops, nothing enforces that a throw statement must occur in the body of a try block. For constraints like this, we simply regenerate the invalid statement or expression until a valid one is yielded. This filtering and retry approach has seen success in fuzzers like CSmith.
The Bare C language has loop-invariant and redundant nonterminals. When the generator selects to add one of
these to the program being generated, it needs to know what expressions are redundant and loop invariant. To
do this we perform an incremental dataflow analysis to keep track of defined variables, available expressions
(which can be reused in a redundant expression), and loop-invariant variables.
Every time a new statement is generated, we perform a dataflow analysis on that statement. For example, when
the assignment x := 10
is generated, we must kill all previous available expressions that use x. We keep track of
the analysis facts in a hierarchical context, where each level of the context corresponds to the current level of
scope nesting the generator is currently at. When we exit a child scope (such as the body of the if), we merge
the corresponding child contexts with all of its siblings (in this example, the else block of the if-else) and then
merge this final output with the parent context.
For example, when generating an if, we perform the following:
/// Generates an if statement
/// # Arguments
/// * `pcfg` - The pcfg containing probabilities for generating stmts in a block
/// * `tp` - The top level pcfg containing probabilities for generating top
/// level blocks
/// * `distribs` - probability distributions
/// * `funcs` - List of available functions
/// * `fuel` - Limit to depth of subtree
/// * `complexity` - Limits to number of statements in the program and subtree
pub(super) fn gen_if<T: Pretty + StatementTy, P: StatementPCFG>(
pcfg: &BlockPCFG<P>,
tp: &TopPCFG,
ctx: &mut Context,
distribs: &mut Distribs,
funcs: &mut FuncList,
fuel: usize,
complexity: &mut Complexity,
) -> Block<T> {
// generates random condition for the if guard
let (cond, _) =
gen_bexpr(&pcfg.if_pcfg.guard, ctx, distribs, funcs, EXPR_FUEL);
// Get a context of analysis facts for the child
let mut true_frame = ctx.child_frame();
// generate blocks for the true branch
let then_block = gen_blocks(
pcfg,
tp,
&mut true_frame,
distribs,
funcs,
fuel - 1,
complexity,
);
// get another context storing analysis facts for the right child
let mut false_frame = ctx.child_frame();
// generate blocks for the right child
let else_block = gen_blocks(
pcfg,
tp,
&mut false_frame,
distribs,
funcs,
fuel - 1,
complexity,
);
// meet the analysis facts of the two children
let sf = Context::meet(true_frame, false_frame);
// update the parent context with any new analysis facts.
ctx.update(sf);
Block::If {
guard: cond,
then: then_block,
otherwise: else_block,
}
}
The incremental analysis also prevents Bear from generating statements after
all paths have hit a return
, break
, continue
, or throw
. The goal is to
generate correct analysis information from an incomplete program as we are generating
it. This means that we cannot perform a traditional iterate until convergence
and instead perform a one-shot analysis pass through the CFG in topological
order as if it were a DAG (Directed Acyclic Graph).
For loop invariance, we avoid the need to perform a traditional iterate until convergence algorithm by randomly selecting some variables to be immutable throughout the body of the loop before we generate any blocks in the loop body. This technique is also used to avoid generating programs that modify the loop iteration variable in ways that create nonterminating programs.
Speaking of nontermination, we use an interval analysis to compute conservative bounds on the values any particular variable may take on. In my last post, I discussed interval analysis and abstract interpretation in some detail, so go check out that post for more information. This information determined by the analysis can be used to ensure that loops terminate in a reasonable amount of time by limiting loops to 10,000 iterations. It also prevents programs from exhibiting divide-by-zero exceptions.
Consider the following expression that the generator has
chosen to produce: x := 1000 / a
, where a
has the range [−100, 100]
.
We want to enforce that $a \not = 0$. The simple approach
currently used is to add to the divisor the absolute value of the divisor’s lower bound,
plus one. Therefore, we get
the statement x := 1000 / (a + 101)
where (a + 101)
has the range [1, 201]
.
A similar procedure takes place for generating the loop step, initializer, and limit to ensure that a loop doesn’t exceed 10,000 iterations.
The way we observe program behavior is through the print
statement. Therefore, during lowering we artificially
insert extra print statements to ensure that we can observe differences in the unoptimized and optimized program
behavior. Once lowered, we run the program directly with the BRIL reference interpreter. This serves as our
ground truth program output. We then compare this output with the output produced by the reference interpreter
after performing the passes and optimizations that we seek to test. This approach is known as differential testing,
and it is how we can determine whether we’ve identified a bug (mismatch in outputs).
While running the program for differential testing, we also want to be able to observe the runtime behaviors of each program to plot the test program in behavior space, which is the space of all runtime behaviors of the programs we are generating. A runtime behavior is a property of the execution of the program such as how many loop iterations it ran through or how many add instructions were executed.
To achieve this, we first insert extra NOP
instructions during lowering that
have attached information on the high-level construct being run. For example, at the start of a loop, we insert
an NOP instruction that contains an argument that specifies the current nested loop depth. During differential
testing, we run tests on a modified interpreter that will dump out a trace of all the instructions run (including
these NOPs). The fuzzer can then examine the trace, and use this information to build a behavior vector.
A behavior vector is essentially a way to encode quantities that describe a program’s runtime behavior as a vector of floating point values. For example, one element in the vector represents the maximum loop depth that was reached during program execution. Another element in the vector is a numeric representation of how the program failed. For example, success is encoded as a 0, failing by causing an exception or crash in any of the compiler stages being tested is encoded as a 6, producing less output than expected is encoded as a 2, producing the same amount, but a different trace of output is encoded as a 1, optimizations which cause a divide by zero is encoded as a 7, etc. This information is vital for novelty search, as we’ll soon discuss.
We use a genetic algorithm known as novelty search, which rewards being different rather than optimizing a specific metric. We use such an approach because we don’t have a great metric to optimize. We could optimize the number of bugs found, but there can be vast areas in the space of programs that are bug-free and the gradient for this could be near 0. Furthermore, this metric could lead to the fuzzer getting stuck in a local optimum where it continually generates similar programs that expose the same bug. Therefore, the idea is to ensure the fuzzer generates many different tests that are adequately spread across the entire space of programs and exercise many different aspects of BRIL.
The workflow of novelty search for Bear is as follows:
Now one question you might have is how to compute this “novelty”? We define novelty as the average distance of a behavior vector to its $k$-nearest neighbors. For Bear, we just use a standard Euclidean distance but scale each element slightly differently so elements that are inherently larger numbers, such as the instruction count, don’t dominate the novelty. Specifically, we classify measurements that scale with the number of dynamic instructions on a log scale before using them as elements in a behavior vector. This is because we care more about the difference between 50 divisions and 0 divisions than the difference between 1200 and 8000 divisions. The former is a difference of 2 orders of magnitude while the latter is a difference of 0 orders of magnitude. So, for a program’s behavior vector $v$, nearest neighbor vectors ${x_1, …, x_k}$, and scaling function $s$, we compute: $$\rho(v) = \frac1k\sum_{i = 1}^k \sqrt{(s(v) - s(x_i)) \cdot (s(v) - s(x_i))}$$ where, for a program with $n$ dynamic instructions and behavior vector $a = \langle a_1, a_2, …, a_j \rangle$ with element $a_i \ge 0$ $$s(a_i) = \begin{cases} \log(a_i + 1) & \text{ if } a_i \propto n \ \enspace a_i & \text{ otherwise} \end{cases}$$ Another view of novelty ($\rho(v)$) is the sparseness of the area around a point in behavior space. The more sparse the area around a behavior vector is, the more novel the behaviors of the program are.
The algorithm keeps track of an archive of individuals that, when they were generated, had a novelty exceeding a particular threshold. The nearest neighbors used to compute the novelty of a behavior vector are taken from individuals in this archive and the current generation.
As an example, consider the following example transition from generation $x$ to generation $x + 1$.
Gen $x$: Gen $x + 1$:
Notice that programs with a lower novelty are not saved to the archive but programs with a high novelty are.
In the Bear implementation the overall structure of the novelty search is as follows:
// Runs the current population and returns the next population.
let pop_idx = params.pop_idx;
let mut bvecs = vec![];
for (i, pcfg) in pop.iter().enumerate() {
for _ in 0..TRIALS_PER_PCFG {
// Generate random cmd line args to run the program with
let prog_args = gen_main_args(args);
// Generates a test and runs it
let (brc, prog, result) =
gen_good_test(pcfg, args, test_pipeline, &prog_args, pop_idx);
bvecs.push((
// Keep track of which pcfg a behavior vector came from
i,
// Construct a behavior vector from a trace file saved to disk
BehaviorVec::new(
&format!("out/rt_{pop_idx}.trace"),
FailureType::from(&result),
),
));
}
}
// Increment number of generations
local_stats.gen_count += 1;
// Select the subset of the population that can reproduce
let r: Vec<_> = select(bvecs, archive, stats, local_stats, params)
.iter()
.map(|(pop_id, _)| pop[*pop_id].clone())
.collect();
// Spawn the next generation
reproduce(&r, params)
/// Selects `SELECT_SIZE` elements from `gen` probabilistically, weighted by their
/// novelty
/// # Arguments
/// * `gen` - The population to select from. The first element of each tuple is
/// the index of the behavior vec's pcfg in the population.
/// # Returns
/// A vector of the selected behavior vectors. The first element of each tuple
/// is the index of the behavior vec's pcfg in the population.
#[allow(clippy::cast_precision_loss)]
fn select(
mut gen: Vec<(usize, BehaviorVec)>,
archive: &Arc<RwLock<Archive>>,
params: &PopParams,
) -> Vec<(usize, BehaviorVec)> {
let old_len = gen.len();
assert!(gen.len() % TRIALS_PER_PCFG == 0);
// Gets the behavior vector with the median novelty out of all the trials
// ran for a given PCFG
gen = extract_medians(gen, archive);
assert!(old_len / TRIALS_PER_PCFG == gen.len());
// ith element of distance is the average distance of the ith element of gen
let mut distances = vec![];
let mut to_save = vec![];
{
// Gets the arvhive point cloud and adds the current generation
// points tp it
let mut cur_gen = archive.read().unwrap().clone_point_cloud();
for (_, individual) in &gen {
cur_gen.add_point(individual);
}
// Computes novelty
for (pcfg_idx, b) in &gen {
let dist = cur_gen
.get_nearest_k(b, K)
.iter()
.map(|(b, _)| b)
.sum::<f64>()
/ K as f64;
distances.push(HeapElem {
dist,
index: *pcfg_idx,
});
// Saves novel data points to the archive
if dist > max_dist * 0.8 || dist > NOVELTY_THRESH {
to_save.push(b.clone());
}
}
}
for e in to_save {
archive.write().unwrap().add_point(e);
}
gen_select_set(&distances, gen, params)
}
/// Selects `params.select_size` elements from `gen` probabilistically, weighted by their
/// novelty (`distances`).
fn gen_select_set(
distances: &[HeapElem],
gen: Vec<(usize, BehaviorVec)>,
params: &PopParams,
) -> Vec<(usize, BehaviorVec)> {
assert_eq!(distances.len(), gen.len());
let sum: f64 = distances.iter().map(|e| e.dist).sum();
let sum = sum.max(f64::EPSILON);
// Mathces each individual with its normalized novelty
// The normalized novelty will be the probability that pcfg
// is chosen to reproduce
let res: Vec<_> = gen
.into_iter()
.zip(distances.iter().map(|e| e.dist / sum))
.collect();
let r: Vec<_> = res
.choose_multiple_weighted(
&mut rnd::get_rng(),
params.global.select_size,
|e| e.1,
)
.unwrap()
.collect();
r.into_iter()
.map(|((pcfg_idx, bv), _)| (*pcfg_idx, bv.clone()))
.collect()
}
Behavior vectors are encoded like so:
/// Classifies the number of instructions into a given class on a log scale.
const fn icount_class(icount: i64) -> i64 {
if icount < 1 {
0
} else if icount < 10 {
1
} else if icount < 100 {
2
} else if icount < 1_000 {
3
} else if icount < 10_000 {
4
} else if icount < 100_000 {
5
} else {
6
}
}
impl BehaviorVec {
pub const fn vectorize(&self) -> [i64; 13] {
// note, these are all dynamic properties of the execution
// not static properties of the source code
[
icount_class(self.icount),
// put the most weight on programs which fail differently
self.failure_type as i64 * 10,
icount_class(self.arithmetic_ops()),
icount_class(self.control_flow_ops()),
icount_class(self.bool_ops() + self.cmp_ops()),
self.max_loop_nest,
icount_class(self.catches),
icount_class(self.duffs_count),
icount_class(self.switch_cases),
icount_class(self.switch_default),
icount_class(self.break_count),
icount_class(self.continue_count),
icount_class(self.max_repeat_jmps),
]
}
}
To speed up the search, and run more test programs, we also employ a multi-population search. Each population evolves, as described above, mostly independently and in separate threads. Every 8 generations, all of the populations will push the genotype of an individual from their most recent generation into a collective “gene pool.” Each population will then use the genes from the gene pool to generate $t$ individuals via crossover, mutation, and reproduction that will belong to each population’s next generation. The remaining individuals will be generated as previously described.
The idea behind multi-population search is inspired by a situation in biology where a species might become isolated on separate islands and evolve independently. At some point, an individual might travel from one island to another, bringing their genes to a different population, however, for the most part, each island population evolves independently and diverges from another.
In the genetic programming paradigm, the goal is to generate more programs to explore more parts of the search space in parallel.
This approach is especially useful to Bear, where much of the time spent by each thread will be on running test programs, reading traces, and dumping logs, since invoking a compiler pass to test involves spawning a subprocess and performing file I/O. Therefore, by using many different threads, we can fill the CPU with active work to be done whenever a thread needs to stall to perform one of these activities.
I ran Bear to test my LVN, SSA, LICM, and DCE implementations. Within seconds, it was able to find at least two bugs in LICM and DCE. I say at least two because I’ve been lazy and haven’t taken the time to investigate the issues, so among the many test programs Bear found to expose bugs, I can only be certain that there are two unique bugs without spending more time to debug the programs.
It took some more runtime, but within 20 minutes, Bear found a bug in my implementation of SSA conversion. Within a few hours, Bear also found a bug in my implementation of LVN involving the handling of large integer constants.
Pass | Min Unique Bugs |
---|---|
SSA | 1 |
LVN | 1 |
DCE | 2 |
LICM | 2 |
In terms of runtime performance, it’s very dependent on the runtime of the compiler pipeline being tested but I’ve found that for a pipeline with two passes being tested (SSA in and SSA out) Bear can generate, run, and evaluate 9.85 tests per second.
This was measured by running Bear for about a minute, dividing the number of tests run by the total running time, and averaging the measured tests per second over 3 trials. These tests were run on an Ubuntu 22.04 machine with Linux kernel version 6.2.0-37. The machine has an Intel i7-13620H with 16 GB RAM.
Find more information and references here