Blog

Not Now Null Pointers: LLVM Null Analysis

Oct 7, 2023 | 26 minutes read

Tags: pl, projects, tutorials, compilers

I’ve always been interested in being able to make guarantees about the correctness of a program without the high cost of doing something like a full-on verification in an automatic theorem prover such as Coq. Moreover, avoiding extra runtime costs would be extremely nice as well, especially for systems programming domains.

I recently began working on an LLVM pass which aims to guarantee the absence of loading and storing of null pointers. The analysis is conservative, so if it cannot be sure that a pointer is non-null, it will produce a compiler error.

This post aims to quickly discuss a few standard compiler concepts, talk about some design and implementation points, and highlight some interesting problems I’ve encountered so far.

Dataflow Analysis

We’re going to frame the null analysis as a dataflow analysis, which can be run in a generalized dataflow framework.

A partial order is a binary relation on a set, $\Gamma$ that is reflexive, transitive, and anti-symmetric. A binary relation is a set, $R$ such that if $x \in \Gamma$, relates to $y \in \Gamma$, then $(x, y) \in R$.

A binary relation is reflexive if every element relates to itself. Formally that is to say that if $x \in \Gamma$, then $(x, x) \in R$.
A relation is transitive if $x$ relates to $y$, and $y$ relates to $z$, then $x$ relates to $z$. Formally, that is $(x, y), (y, z) \in R \Rightarrow (x, z) \in R$. Note that this definition means that the relation $R$ = {(1, 2)} is transitive.
Finally, a relation is anti-symmetric if when $x$ relates to $y$ and $y$ relates to $x$, this means $x = y$. The canonical example of an anti-symmetric relation is $\le$ and $\ge$.

The key thing behind a partial order (especially in comparison to a total order), is that some elements may be incomparable. A partial order you may be familiar with is the subset relation ($\subseteq$). For example: {$x, y$} $\not \subseteq$ { $a, b, c$ } and { $a, b, c$ } $\not \subseteq$ { $x, y$ }.

A set with a partial order is commonly referred to as a poset. From this point on, we’ll be working with the poset $(\Gamma, \sqsubseteq)$ which is the set $\Gamma$ with the ordering $\sqsubseteq$.

For some posets, there may exist an element, $\bot$ (“bottom”), such that $\forall x \in \Gamma, \bot \sqsubseteq x$. $\bot$ is “lower” or less than all elements in the ordering. Likewise, there may exist an element $\top$ (“top”) such that $\forall x \in \Gamma, x \sqsubseteq \top$. That is $\top$ is “higher” or greater than all elements in the ordering. A poset with a top element but no bottom is known as an upper semi-lattice. A poset with a bottom element but no top is known as a lower semi-lattice. If a poset has both, it is called a (complete) lattice. The name lattice comes visualizing posets with Hasse diagrams.

Elements $x$ and $y$ in a lattice or semi-lattice have a greatest lower bound, or “meet”, which is denoted as $x \sqcap y$. The greatest lower bound is the biggest element which is smaller than both $x$ and $y$. For example, if $\Gamma$ is the set of all sets of lower-case alphabetic characters and $\sqsubseteq$ is the subset relation, then {$a, b$} $\sqcap$ {$a, c$} is {$a$}. Furthermore, in this example, $\bot = \emptyset$.

Elements $x$ and $y$ in a lattice or upper-semi lattice have a least upper bound, or join, which is denoted as $x \sqcup y$. The least upper bound is the smallest element which is greater than both $x$ and $y$. Building off the previous example, {$a$, $b$} $\sqcup$ {$a$, $c$} is {$a, b, c$}. Likewise, $\top$ is the set of all 26 lowercase letters.

Observe when $\bot = \emptyset$, $\sqcap$ is set intersection, $\sqcup$ is set union, and $\sqsubseteq$ is the subset relation. If $\top$ is the empty set, then $\sqcap$ is set union, $\sqcup$ is set intersection, and $\sqsubseteq$ is the superset relation, $\supseteq$.

We can view dataflow analysis through the lens of order theory. That is, a dataflow analysis consists of a set of facts, $\Gamma$, an ordering $\sqsubseteq$, a meet operator $\sqcap$, a direction $D$, and a transfer function $F : \Gamma \rightarrow \Gamma$ which determines how an input fact is going to be transformed for a CFG node $n$. An analysis is forward if it propagates facts from the CFG start to the end node, or backward if it propagates facts from the exit to the start. Typically, I find it easier to think of $F$ as a set of transfer functions $F_n$ where $F_n$ is the specific transfer function for a node $n$. Nodes in a CFG are typically basic blocks, however, I like to think of $F_n$ as operating on the level of single instructions because it leads to simpler transfer functions and allows factoring out the logic for traversing a basic block into the generalized data flow framework.

In this diagram, the outer box represents a basic block with a sequence of instructions. Input facts come from the predecessor blocks, are converged with the meet operator, and then threaded through each instruction of the basic block. Each $F_n$ represents the transfer function of a particular instruction. At the terminator of the basic block, facts are propagated to the basic block’s successor nodes.

A dataflow analysis can be solved by an iterative algorithm, which in pseudocode is as follows

for n in nodes:
    out[n] = T

while changing:
    in[n] = in[n] = Meet([out[n_] for n_ in predecessors(n)])
    out[n] = Fn(in[n])

In practice, I tend to implement the worklist algorithm which in pseudocode is

W = []
for n in nodes:
    out[n] = T
    W.push(n)

while W not empty:
    in[n] = Meet([out[n_] for n_ in predecessors(n)])
    out_n = Fn(in[n])
    if out_n != out[n]:
        out[n] = out_n
        for s in successors(n):
            W.push(s)

If each transfer function is monotonic (informally it either adds stuff or removes stuff but not both) then the algorithm will terminate with a solution. Formally, we require that $$x \sqsubseteq y \Rightarrow F_n(x) \sqsubseteq F_n(y)$$

I personally like using the lens of order theory as it helps sanity-check the correctness of my analyses. For example, if your meet operator is bringing elements toward your top element, something’s not right.

Often times I like to have my lattice top be the empty set. This is particularly useful when other options for $\top$ aren’t easily computable such as the set of all copies in an available copies analysis. But this means that my meet operator must be a set union. However, for some analyses, such as available copies, meeting loses information which is more suited for set intersection. Therefore, what I end up doing is keeping track of elements that are not in the set.

I think this kind of playing with the ordering and flipping lattices around is quite useful, and I think it becomes more intuitive with an understanding of the underlying theory.

Abstract Interpretation

Sometimes, it’s helpful to think of a dataflow analysis as an execution of the program over an abstracted domain of a set of program states. Program states we might be interested in could be the values of variables or the data stored in the heap. A set of concrete program states can be abstracted by an abstraction function, which forms a mapping between concrete program states in a concrete lattice and abstract states in an abstract lattice.

Suppose our program contains a single pointer, $x$, so its concrete state would be the value of this pointer. Let the abstract domain of the single pointer be three values, $\top$, NonNull, and $\bot$/MaybeNull. In this case, $\top$ represents an uninitialized value, NonNull is a nonzero value, and $\bot$ is 0 or an unknown value. Here $x$, which in the concrete domain may be one of $2^{64} + 1$ possible values, abstracts to just 3. We can define a representation function $\beta$ as

$$ \beta(x) = \begin{cases} \text{NonNull} \text{ if } x \not = 0, & \ \top \text{ if } x = \top, & \ \bot \text{ otherwise} & \ \end{cases} $$

Then, for a lattice over a set of program states (such as all pointers in the program), the abstraction function $\alpha$ could be defined as

$$\alpha(\gamma) = \begin{Bmatrix} \forall x_i \in \gamma, \beta(x_i) \end{Bmatrix}$$

That is to say, we can abstract a set of program states as the set of abstracted program states mapped by the representation function.

So in our example

$$ \alpha(\begin{Bmatrix} x: 0, & y: 0xFFEF, z: 0x678F, w: 0 \end{Bmatrix}) = \begin{Bmatrix} \beta(x), & \beta(y), & \beta(z), & \beta(w) \end{Bmatrix} = \begin{Bmatrix} \bot, & \text{NonNull}, & \text{NonNull}, & \bot \end{Bmatrix} $$

Under abstract interpretation, we are kind of dealing with a nested lattice. In this example, we have a lattice of sets of abstract pointers, and each abstract pointer is a member of the $\top$ - NonNull - $\bot$ lattice.

Null Analysis as Abstract Interpretation

With this in mind, we can finally discuss the design of the analysis. A fact in our lattice will be a tuple of all the pointers in the program. Therefore, $\top$ is the tuple of all pointers being the top value in the MaybeNull/NonNull “inner” lattice. $\bot$ is the tuple of all pointers having the abstract MaybeNull value. Meet will be the element-wise meet of the tuples. That is we will take the meet of all corresponding abstract pointers in the MaybeNull/NonNull lattice. While it doesn’t manifest itself in the implementation, our ordering will be element-wise ordering in the MaybeNull/NonNull lattice. So $(\bot, \bot) \sqsubseteq (\text{NonNull}, \bot)$ while $(\bot, \text{NonNull})$ cannot be compared to $(\text{NonNull}, \bot)$.

In the actual implementation, instead of doing an extra pass to identify all pointers first, I consider any pointer not added to the abstract value mapping yet as implicitly $\top$.

Our transfer function is going to depend on the instruction type. Like Conditional Constant Propagation, the output fact can differ for different outgoing edges. Consider the following:

%5 = icmp ne ptr %4, null
br i1 %5, label %6, label %8

We’re only going to branch to label 6 if %4 is non-null. Therefore, on the edge going to label %6, we can set %4 as NonNull. On the other hand, when we branch to label %8, we know that %4 is null, so we can set its abstract value to MaybeNull.

A similar process can be done for the eq opcode of the icmp instruction as well.

Here’s a summary of some of the other currently implemented transfer function, where State is a map from llvm::Value* to that value’s respective abstract value. Abstract values are stored in shared pointers to allow for this analysis to work with nested pointers and aliases. The way to read this table is, given an instruction $x$, to pick the transition $(u, S)$ where $u$ is the most specific matching instruction to $x$, and $S$ is $u$’s corresponding state change.

Instruction	State Change
`%1 = alloca ...`	`State[%1] = NonNull`
`%1 = load ptr, ptr %2, ...`	`State[%1] = State[%2]`
`store ptr %2, ptr %1, ...`	`State[%1] = State[%2]`
`%1 = call nonnull ... ptr ...`	`State[%1] = NonNull`
`%1 = call ptr ...`	`State[%1] = MaybeNull`
`%1 = phi [%2 ...] [%3 ...] ...`	`State[%1] = State[%2]` $\sqcap$ `State[%3]`
`%1 = getelementptr ...`	?
`%1 = ...`	`State[%1] = MaybeNull`

There’s one more instruction of interest: getelementptr. Suppose we are accessing an array, in LLVM this might look like the following:

%110 = load i32, ptr %12, align 4
%111 = sext i32 %110 to i64
%112 = getelementptr inbounds [513 x i8], ptr %9, i64 0, i64 %111
%113 = load i8, ptr %112, align 1

How can we know whether the array access results in a potentially null pointer? For that, we’re going to need an interval analysis.

But before we move on, one last thing to note is that we also need to take care of function arguments: a pointer argument is NonNull if it has the LLVM attribute of the same name, otherwise it must be MaybeNull.

Interval Analysis as Abstract Interpretation

Interval analysis is similar in spirit to conditional constant propagation, but instead of keeping track of single constants, it keeps track of a range of possible values that a value may be.

Suppose we know the size of a char array to be 512. Then if we can be sure that the index into the array is between 0 and 511 (inclusive), the access is safe!

Since we’re concerned with indexing into an array, we only care about integral types. So our fact is going to be a mapping from integer values to bounds. A bound is going to be a lattice where each element consists of two integers, Lower and Upper, that represent an inclusive range. $\top$ in this lattice is going to be an unknown bound, and $\bot$ will be the bound $(-\infty, \infty)$. Therefore, in our lattice of program states, $\top$ will be the mapping of every value to an unknown bound and $\bot$ will be the mapping of every value to the unbounded range $(-\infty, \infty)$.

$\sqcap$ on our lattice of program states will be $\sqcap$ on the bounds of corresponding elements. The meet of two bounds is rather tricky, so we’re going to come back to this later. We’ll also define arithmetic and comparison operators on bounds. These mostly take the form of:

$$ \begin{aligned} Op(a, b) = \begin{cases} b \text{ if } a = \top, & \ a \text{ if } b = \top, & \ \bot \text{ if } a = \bot \lor b = \bot, & \ OpInt(V(a), V(b)) \text{ otherwise } \end{cases} \end{aligned} $$

Where $V(a)$ means the value of $a$. The operation on an interval ($OpInt$) involves constructing the smallest interval that contains the applications of the integer operation on all four pairs of upper and lower bounds of the two intervals. For example:

IntRange operator*(const IntRange& A, const IntRange& B)
{
    IntRange Res;
    const auto LL = A.Lower * B.Lower;
    const auto LU = A.Lower * B.Upper;
    const auto UL = A.Upper * B.Lower;
    const auto UU = A.Upper * B.Upper;
    Res.Lower = min({LL, LU, UL, UU});
    Res.Upper = max({LL, LU, UL, UU});
    return Res;
}

We need to take all four points into account instead of just applying the integer operation to the two lowest and two highest values of each interval because of cases such as $[-5, 2] * [-1, 3]$. If we construct the new interval as [a.lower * b.lower, a.upper * b.upper] then we would be left with [5, 6] which doesn’t contain the value 0, which can result if either a or b is 0.

Like the null analysis, what makes this useful is if we propagate different facts along different edges of a branch.

Consider the following:

What can we be sure of being true? Well, on the true edge, we know that $i < len$ and $len > i$. On the false edge, we know that $i \ge len$ and $len \le i$.

If2

To understand how we need to meet bounds, let’s walk through an example that will roughly correspond to the following code:

assert(len < 100);
for (int i = 0; i < len; ++i) {
    // ..
}
for (int j = 0; j < len; ++j) {
    // ..
}

We can imagine that in a real program, these loops might contain accesses to an array that has 100 elements. The following is a simplified instruction-level CFG of this program.

We will consider unlabelled variables on an edge as being $\top$. We’ll start by assuming that we know that $len \in (-\infty, 99]$. The meet nodes don’t exist in the CFG and are just for visualization purposes. They represent the meeting of a node’s input facts before they are passed to the node’s transfer function. You can think of them as $\phi$ nodes in this CFG, however, my implementation takes into account bounds stored in memory, so that isn’t a perfect analogy.

Let’s go through this step-by-step. We first come to the i = 0. The natural thing to do would be to set the bound of i as $[0, 0]$. At the meet point, we’d be meeting this bound with $\top$, so this bound is passed to the branch.

Now we reach a branch. On the true branch, we know that i < len and len > i and on the false branch, we know that i >= len and len <= i. Updating the bounds is also a bit tricky to see how to do correctly, but the somewhat intuitive solution I came up with is for a less than-comparison, a < b, set the upper bound of a to be a.upper = min(a.upper, b.upper - 1). Analogously a similar operation will take place for >, <=, and >=:

Comparison	Update
`a < b`	`a.upper = min(a.upper, b.upper - 1)`
`a <= b`	`a.upper = min(a.upper, b.upper)`
`a > b`	`a.lower = max(a.lower, b.lower + 1)`
`a >= b`	`a.lower = max(a.lower, b.lower)`

This means on the true branch, we get to assume that i: [0, 0], len: [1, 99], and on the false branch we have i: [0, 0] and len: [0, 0].

Now we reach i++. The intuitive thing would be to increment the bound of i, so let’s do that.

This brings us to the question I put off earlier: how do we meet bounds? The most intuitive thing (at least for me) would be to take the smaller lower bound and the greater upper bound. This yields the result of the meet as i: [0, 1] and len: [-inf, 99]

Doing this results in the following:

Hmm, that can’t be right. Observe what happens to i when we analyze the loop again. We increment the bound i: [0, 1] due to i++ to get i: [1, 2]. Then, when we meet i: [1, 2] with i: [0, 0] we get i: [0, 2]. Rinse and repeat and you have yourself a bona-fide infinite loop where the upper limit of i will just keep increasing.

Ok, ok. What about if when we have two non-overlapping bounds, we meet them at $\bot$. This is what conditional constant propagation does after all. This idea is safe, but it’s very conservative. Consider the following which is not part of a loop:

The use case for this might be something like an if and else block. Are we really going to say that after meeting, a is $\bot$? Come on, let’s say that a: [0, 30].

Alright, so if we want to support both cases, we need something more sophisticated.

The key culprit is mutation, an all too common troublemaker in computer science. The mutation I’m talking about can come from either memory stores or $\phi$ nodes, although one might not typically think of $\phi$ nodes as mutating. We’ll need to augment our analysis to not just keep track of ranges, but also mutations. A problematic mutation occurs when an incoming value to a $\phi$ node uses the same $\phi$ node or when a new range overwrites a preexisting one in a store instruction.

When we meet i: [0, 0] with i: [1, 1] in our previous example, what should happen? Well, we should get i: [0, inf]. So suppose value A overwrites value B. We’ll first flag the new value A as performing a mutation. Now let’s say A later meets with C, which was not flagged as mutating. If A.lower > C.lower and A.upper > C.upper, then the result’s upper bound should be $\infty$, otherwise, it is max(A.upper, B.upper). Similarly if A.lower < C.lower and A.upper < C.upper, the the result’s lower bound should be $-\infty$. Otherwise, its bound is min(A.lower, C.lower).

This yields the desired result for both cases! Before we fix the result of the meet for the first loop, let’s finish the first iteration for the second loop.

I’ve gone ahead and propagated the values through the assignment to j and the meet, as it is the same as the first loop. So now what happens on the true branch of if j < len. Well, by our previously discovered rule, we get j: [0, -1] and len: [1, 0]. Something is not right. These bounds are invalid. Well, let’s just roll with it for now I guess. We’ll just fix the bounds so that they at least make sense.

Now, when we meet i: [0, 1] with i: [0, 0] by our previous rule, we get i: [0, 1] since both bounds of the mutated value are not strictly greater than the old value. But consider the case when we increment j only when a condition is true. That is to say, j++ doesn’t always execute. After we meet j++ with the path that did not increment j, we’re going to end up with the bound j: [-1, 1]. But now, no matter what we do, we’re always going to have the lower bound of j being -1. While this is safe, it’s way too conservative since there’s no way for j to become negative, and if j accesses some array inside the loop, we’d like for the analysis to be able to tell us that it’s safe.

Ok, so what went wrong? Well, we don’t want these invalid bounds affecting future iterations. So the solution I came up with is to set bounds to $\top$ if they ever become invalid. This allows the invalid bounds to be ignored once they reach the next meet.

Going back to our original example, if we suppose on the second iteration that len will be something more permissive than [0, 0] (as you’ll soon see when we return to the first loop), then this invalid bounds policy gives the following:

Now the meet produces j: [0, 0] and len: [-inf, 99], which as you may recall were the initial states of the first loop. This sets us up for the second loop to follow in the footsteps of the first loop.

So with these hiccups resolved, let’s continue to the second iteration of the iterative solving algorithm. For the first meet operation, as we discussed earlier, we get i: [0, inf] and len: [-inf, 99]. Propagating this to the first branch instruction yields i: [0, 98] and len: [1, 99] for the true branch. After i++, this becomes i: [1, 99] and len: [1, 99]. These are the fixed point or steady-state bounds for the first loop.

Turning our attention to the false branch, we will get i: [0, inf] and len: [-inf, 99]. Propagating this gives the same results for the first iteration of the first loop.

And finally, one more iteration to reach the overall fixed point:

The key thing to notice is that, in the loop, we have guaranteed bounds on i and j. Consider the very common pattern of code that looks something like this:

// length of arr >= len
for (int i = 0; i < len; ++i) {
    /* tmp */ = arr[i];
    // ...
    arr[i] = // ...

}

We know that these memory loads and stores are safe at compile time, without any runtime checks!

With this analysis, we can finish the transfer function of getelementptr for our null analysis. In particular, if we additionally keep track of the known size of the data each pointer refers to, then the transfer function of getelementptr looks like the following:

%1 = getelementptr ..., ptr %2, i64 0, i64 %3, ...
                       |
                       |
%1 = NonNull if Bound(%3).lower >=0 && Bound(%3).upper < Size(%2)
%1 = MaybeNull otherwise

Implementation Details

So far I haven’t considered how to handle logical and or logical or. Due to short-circuiting, the Clang frontend implements logical and as a branch and a $\phi$ node where the incoming value on the false branch is false. Similarly for logical or, except true for the short-circuiting true branch. So && looks like the following:

5:                                                ; preds = %13, %0
  %6 = load ptr, ptr %3, align 8
  %7 = icmp ne ptr %6, null
  br i1 %7, label %8, label %11

8:                                                ; preds = %5
  %9 = load i8, ptr %2, align 1
  %10 = trunc i8 %9 to i1
  br label %11

11:                                               ; preds = %8, %5
  %12 = phi i1 [ false, %5 ], [ %10, %8 ]
  br i1 %12, label %13, label %15

To support this, I do something extra to handle branches whose condition is a $\phi$ node with two incoming values and one of them is a boolean literal. If the boolean literal is false and the true branch is taken, then the non-literal incoming value must have been true. So we can recurse and call the transfer function on this non-literal, propagating the facts we can assume when the incoming condition is true to the true branch of the original branch. Symmetrically, if the boolean literal is true and the false branch is taken, then the non-literal incoming value must have been false. So we recurse and call the transfer function on this non-literal, propagating any new assumption we can make when the condition we recursed on is false to the false edge of the original branch.

Another implementation hiccup: when we propagate facts through a transfer function, we want to make copies of the data. You can observe this by noting that in my interval analysis example, i takes on different bounds depending on where it is in the program. This seems obvious, but recall I mentioned that our null analysis stores abstract values in pointers to handle nested pointers and aliases. Therefore, care needs to be taken to ensure correct deep copies are performed in the copy constructor and copy assignment operator.

For the interval analysis, I didn’t go the pointer route. However, we can’t exactly make copies everywhere because this won’t support storing bounds in memory locations. Instead, I only keep track of the canonical values. Intuitively, the canonical value is like the “home” location of a particular value (using “value” in the program sense, not the LLVM sense). This is very much like how one could implement LVN. One could define the canonical value recursively like so:

/**
 * @brief Get the value that is the canonical representative of `V`. The
 * canonical representative is the value that is used to represent `V` in the
 * mapping from values to ranges. For example, if `V` is a load instruction, the
 * canonical representative is the pointer operand of the load instruction.
 */
const llvm::Value* getCanonicalValue(const llvm::Value* V)
{
    if (const auto Cast = dyn_cast<CastInst>(V); Cast != nullptr) {
        return getCanonicalValue(Cast->getOperand(0));
    } else if (const auto Load = dyn_cast<LoadInst>(V); Load != nullptr) {
        return getCanonicalValue(Load->getPointerOperand());
    } else {
        return V;
    }
}

The last implementation detail to note is that my abstract values are wrapped inside a LatticeElem<T>. This type is essentially a variant type that contains either $\top$, $\bot$, or a value of type T. This allows me to factor out basic handling of $\top$ and $\bot$ which single abstract values typically need during abstract interpretation.

For example, LatticeElem<T>::meet allows users to make a specific meet function that assumes that the lattice elements being met are not $\top$ nor $\bot$.

/**
 * @brief Meets two individual lattice elements (greatest lower bound).
 * If either is bottom, returns bottom. If either is top, returns the other.
 * Otherwise, returns the function applied to the two facts.
 *
 * @tparam Meet callable object to perform the meet of two inhabited facts
 * @param A The first fact
 * @param B The second fact
 * @param M The meet function, which must be callable with two facts and
 * return a new fact or a lattice element.
 * @return a new lattice element that represents the meet of the two lattice
 * elements
 */
template <typename Meet>
requires std::is_invocable_v<Meet, const T&, const T&>
static LatticeElem meet(const LatticeElem& A, const LatticeElem& B,
                        Meet&& M)
{
    if (A.isBottom() || B.isBottom()) {
        return makeBottom();
    }
    if (A.isTop()) {
        return B;
    } else if (B.isTop()) {
        return A;
    }
    if constexpr (std::is_convertible_v<decltype(M(A.value(), B.value())),
                                        LatticeElem>) {
        return M(A.value(), B.value());
    } else {
        return LatticeElem(M(A.value(), B.value()));
    }
}

Evaluation and Future Ideas

Here’s a small C program that concatenates the first letter of every command line argument.

#include <cassert>
#include <cstdio>
#include <cstdlib>
int main(int argc, const char** argv)
{
    if (argc > 256) {
        printf("Too many arguments\n");
        return 1;
    }
    // skip the filename
    int arg_num = argc - 1;
    char args[257];
    for (int i = 0; i < arg_num; ++i) {
        const char** arg = argv + i + 1;
        // analysis can't assume anything about `argv`, so we must
        // check for nulls
        if (arg) {
            const char* word = *arg;
            if (word) {
                args[i] = *word;
            }
        }
    }
    args[arg_num] = '\0';

    printf("%s\n", args);
}

You do have to help the analysis a little bit. In this case, I added the extra checks on argc and argv. Let’s compile this with our analysis, and it should be able to tell us our program is safe:

Hm. It’s not liking the line: args[arg_num] = '\0';. Ah, that’s because arg_num can be negative. Nothing is stopping someone from calling main and passing -10 to argc! This oversight is an honest mistake I made when testing the analysis pass: I intended to write a passing test but soon discovered I overlooked argc being negative.

Here’s a program that sorts the input arguments and honestly does pass:

#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <cstring>
int main(int argc, const char** argv)
{
    assert(argc == 2);
    const char** arg = argv + 1;
    assert(arg != nullptr);
    const auto str = *arg;
    assert(str != nullptr);
    const auto len = strlen(str);
    assert(len <= 512);
    char sorted[513];
    for (int i = 0; i < len; ++i) {
        const char* c = (str + i);
        assert(c != nullptr);
        sorted[i] = *c;
    }

    for (int i = 0; i < len; ++i) {
        char min = sorted[i];
        int min_idx = i;
        for (int j = i + 1; j < len; ++j) {
            if (sorted[j] < min) {
                min = sorted[j];
                min_idx = j;
            }
        }
        const auto tmp = sorted[i];
        sorted[i] = min;
        assert(min_idx < len);
        sorted[min_idx] = tmp;
    }
    sorted[len] = '\0';

    printf("%s\n", sorted);
    return 0;
}

Again, the analysis requires writing things in a way that helps it out, but overall it can figure out a decent amount of stuff on its own!

But this does reveal a large number of things that aren’t currently supported:

Dynamic intervals. Currently, all intervals are constant, but even if we have two variables, a and b, that could take on any value, knowing that a < b can still be quite useful. For example if a is an iteration bound and b is the size of a dynamic buffer. Having relations between values would also allow the analysis to remove the assert(min_idx < len) in the second example.
Support intrinsics. I also don’t invalidate pointers when they’re passed to a function. This isn’t good because that function might set the pointer to null! The reason I didn’t do this yet is because it would require supporting intrinsics to not be overly conservative. If we also modify the code to add an assignment of the pointer to null after every free/delete, then we also get protection from double-free errors as well.
Use interprocedural analysis. We know all call sites to functions with internal linkage, so we can analyze internal functions with the specific sets of arguments that we pass to them. Likewise, we can analyze the actual return values from functions instead of conservatively assuming all return values are $\bot$.
Handle dynamic arrays. The infrastructure is already in place, but currently malloc doesn’t set the size of the data a pointer refers to.
Structs
For a really big bonus challenge: do something with globals.

I plan to come back to these improvements in the future. But that’s all for now, be sure the check out the source if you’re interested.

Part 2

Source