Accurate collision detection is a vital part of video games and simulation, and fast collision detection is paramount for real-time applications such as games. In high school, building such a fast and accurate collision engine eluded me. This post describes, conceptually, the algorithm I settled upon when developing Project Oort. I will assume a basic familiarity with the problem of collision detection, including bounding spheres and axis-aligned bounding boxes, simple data structures like trees, and the basics of concrete linear algebra including vectors, matrices, and linear transformations. I will try and keep the required mathematical background to just what’s covered in the linked article. My goal is that this should make sense without knowing much more than high school math.
The first big upgrade we can make over a simple AABB collision detection system is to use Oriented Bounded Boxes (OBB). Unlike AABBs, where the sides of the bounding box must be parallel to the global coordinate axes, an OBB can have an arbitrary orientation.
In this 2D example, you can see that for odd-sized shapes, an OBB can often give us a tighter bounding box and therefore more accurate collision detection.
Now, there are multiple ways of representing an OBB, but one convenient representation parametrizes the bounding box by its center and local x, y, and z coordinate axes. In other words, we have:
where the star indicates the center and the arrows indicate each local coordinate axes. An equivalent representation, known as the axes-extents representation, is to represent the OBB with its center, 3 directional (unit length) vectors for each axis, and then the extents, or magnitudes, of those vectors stored separately.
/// A fully defined OBB with arbitrary basis vectors
#[derive(Clone)]
pub struct Obb {
    pub center: Point3<f64>,
    /// The length of the x, y, and z basis vectors
    pub extents: Vector3<f64>,
    /// The x basis vector
    pub x: Vector3<f64>,
    /// The y basis vector
    pub y: Vector3<f64>,
    /// The z basis vector
    pub z: Vector3<f64>,
}
I should also mention that an OBB can be represented entirely as a 4x4 matrix. In row-major, block form, where $c$ is the center as a column vector:
$$ O = \begin{bmatrix} x & y & z & c \\ 0 & 0 & 0 & 1 \end{bmatrix} $$
The advantage of this is that to transform the Obb by a model matrix $M$, we can simply multiply the above Obb representation, $O$, by $M$ such that $O’ = M \times O$. So when we transform the visual representation of an object by multiplying matrices, we can move the bounding volume with it by doing the same thing!
For the interested reader, $O$ is essentially a change-of-basis matrix that converts world space points into the coordinate space of the OBB. In this new OBB-relative coordinate space, $\langle 1, 0, 0 \rangle$ is the OBB x-axis, $\langle 0, 1, 0 \rangle$ is the y-axis, and $\langle 0, 0, 1 \rangle$ is the z-axis.
Now given an arbitrary set of points (ie. from a 3D model), computing an OBB of best fit is not exactly a trivial task. I won’t go into the details too much here, but conceptually, we want our OBB’s $x$, $y$, and $z$ axes to be aligned with the 3 directions that have the greatest variance (spread) of the points. To see why, imagine how an axis-aligned bounding box in 2D would have to be scaled to encompass a long, diagonal line. One approach to computing this would be to use Principle Component Analysis (PCA). However, for greater robustness, the data points we most care about are those that lie on the “outside” of an object. In other words, we don’t want to be thrown off by “interior” outliers. So, what we want to do, is perform PCA on the exterior points of the model which can be computed by finding the Convex Hull. This is an NP-Hard problem in general, however, polynomial-time algorithms exist for the 2D and 3D cases.
This is not the approach I took. We can observe that, most artists create models that are axis-aligned in the local space the model is defined in. So we can define an AABB for our model in local space, and then convert it to an OBB. Moreover, using this approach, we can save memory by storing the OBB as an AABB in local space, and converting to an OBB when we do collision detection. Here’s one such implementation:
/// Creates an OBB by applying a world transformation matrix to an AABB
fn from_local_aligned(aabb: &Aabb, model: Matrix4<f64>) -> Self {
    let center = model.transform_point(aabb.center);
    let x = model.x.truncate();
    // x.magnitude() is the scale of the model in the x direction
    let ex = model.x.magnitude() * aabb.extents.x;
    let x = x.normalize();
    let y = model.y.truncate();
    let ey = model.y.magnitude() * aabb.extents.y;
    let y = y.normalize();
    let z = model.z.truncate();
    let ez = model.z.magnitude() * aabb.extents.z;
    let z = z.normalize();
    Self {
        center,
        extents: vec3(ex, ey, ez),
        x,
        y,
        z,
    }
}
Note this code doesn’t explicitly use the matrix construction we talked about earlier, but instead breaks up the matrix multiplication manually. In hindsight, I think I would prefer just storing an OBB as a matrix and creating accessor methods to get at each data component (center, axes, etc.) instead of a data structure with each component as a variable.
Now for the interesting part: how can we detect collisions between OBBs? We can apply the Separating Axis Theorem which states that two convex shapes are not intersecting if there is a separating hyperplane that can be drawn between them. So in 2D, two OBBs are not intersecting if we can draw a line between them. More concretely, a separating hyperplane will have an axis perpendicular to it such that the projection of all the points onto the axis doesn’t overlap. In 2D:
Since a point on a 1D line can be represented by a single number (intuitively this is the distance the point lies along said line), then we can determine whether the projected points overlap by considering the left-most and right-most points of each OBB. Recall that to project a point $p$ on a line, $v$ we can compute $r = p \cdot v / (v \cdot v)$ where $\cdot$ indicates the dot product and the result $r$ can be thought of as the distance the projected point lies along the line. In the case where $v$ is unit length, $v \cdot v = 1$ and this formula simplifies to $p \cdot v$.
So, projecting an OBB on a line amounts to the following code:
/// Requires `axis` is not 0 and is normalized
/// Gets the projected min and max projection coefficients of the OBB onto the axis
fn project_onto(&self, axis: &Vector3<f64>) -> (f64, f64) {
    // 8 corner points of the OBB
    let pts = [
        self.center
            + self.extents.x * self.x
            + self.extents.y * self.y
            + self.extents.z * self.z,
        self.center
            + self.extents.x * self.x
            + self.extents.y * self.y
            + self.extents.z * -self.z,
        self.center
            + self.extents.x * self.x
            + self.extents.y * -self.y
            + self.extents.z * self.z,
        self.center
            + self.extents.x * self.x
            + self.extents.y * -self.y
            + self.extents.z * -self.z,
        self.center
            + self.extents.x * -self.x
            + self.extents.y * self.y
            + self.extents.z * self.z,
        self.center
            + self.extents.x * -self.x
            + self.extents.y * self.y
            + self.extents.z * -self.z,
        self.center
            + self.extents.x * -self.x
            + self.extents.y * -self.y
            + self.extents.z * self.z,
        self.center
            + self.extents.x * -self.x
            + self.extents.y * -self.y
            + self.extents.z * -self.z,
    ];
    // let a_dot = axis.dot(*axis);
    // precondition that axis is normalized so we don't need to divide by a_dot
    let mut min = f64::MAX;
    let mut max = f64::MIN;
    for pt in pts {
        let r = axis.dot(pt.to_vec()); // / a_dot;
        min = min.min(r);
        max = max.max(r);
    }
    (min, max)
}
Then determining if an axis defines a separating hyperplane is straightforward:
    let (min1, max1) = self.project_onto(&axis);
    let (min2, max2) = other.project_onto(&axis);
    min1 > max2 || min2 > max1
There’s one final hiccup: How do we know how many, and what axes to test? It turns out, in 3D, we need to test at most 15 axes. We need to test an axis that is perpendicular to each face of both OBBs we are testing for collision (the $x$, $y$, and $z$ vectors of each OBB), and we need to test axes that are perpendicular to each pair of the aforementioned vectors. Recall we can use the cross product to compute a vector that’s orthogonal to two vectors.
So all together we have:
/// Determines if an axis is perpendicular to a plane that separates the two OBBs
/// (is a separating axis)
///
/// requires axis normalized
/// returns true if tests passes (no collision)
///
/// `axis` - either a tuple of an axis to test and `None`, or a tuple of an axis 
///     from this OBB, and an axis
/// from `other`'s OBB whose cross product is the axis to test
fn is_separating_axis(
    &self,
    other: &Self,
    axis: (Vector3<f64>, Option<Vector3<f64>>),
) -> bool {
    let (axis_a, axis_b) = axis;
    let axis = match axis_b {
        None => axis_a,
        Some(axis_b) => {
            let a = axis_a.cross(axis_b);
            if a.magnitude() < f64::EPSILON {
                // // axes are parallel, and lie in some plane P
                // // choose a new axis perpendicular to that plane
                // let n = axis_a
                //     .cross(self.center + axis_a - (other.center + axis_b));
                // if n.magnitude() < f64::EPSILON {
                //     return false;
                // }
                // n.normalize()
                // for greater robustness, we should test again with the
                // above commented out code, but I found returning false here 
                // works well enough
                return false;
            }
            a.normalize()
        }
    };
    let (min1, max1) = self.project_onto(&axis);
    let (min2, max2) = other.project_onto(&axis);
    min1 > max2 || min2 > max1
}
/// Returns true if there is a collision between this obb and `other`
fn collision(&self, other: &Self) -> bool {
    let axes = [
        // Test each face
        (self.x, None),
        (self.y, None),
        (self.z, None),
        (other.x, None),
        (other.y, None),
        (other.z, None),
        // Test combinations of axes
        (self.x, Some(other.x)),
        (self.x, Some(other.y)),
        (self.x, Some(other.z)),
        (self.y, Some(other.x)),
        (self.y, Some(other.y)),
        (self.y, Some(other.z)),
        (self.z, Some(other.x)),
        (self.z, Some(other.y)),
        (self.z, Some(other.z)),
    ];
    for a in axes {
        if self.is_separating_axis(other, a) {
            // exit early if we find a separating axis
            // if there exists a separating axis, then
            // the OBBs are not colliding
            return false;
        }
    }
    true
}
OBBs are relatively fast, and a huge improvement over bounding spheres or AABBs. However, unless you’re making a game where every object is rectangular, there are still plenty of cases where two OBBs might intersect despite there being plenty of space between the models.
It would be great if when we determine that two OBBs intersect, we investigate further and test each pair of triangles that make up the meshes of the objects in question. This is a foolproof method with utmost accuracy! However, there’s a reason things like OBBs and other bounding volumes exist: a mesh can have millions of primitives and testing each pair (which is $O(n^2)$) can be prohibitively expensive. We discuss more optimizations later in this post, but a common solution is to use a collision mesh. A collision mesh is a low-poly model that simplifies the visual model used for graphics such that it has significantly fewer primitives. This makes it more feasible to be used for collision detection. In Project Oort, all of the models I used were simple enough that I didn’t need collision meshes (also because it is probably faster for me to optimize code than try and make a collision mesh).
Our triangle-triangle intersection test is going to be the one outlined by Moller 1, so let’s dive in!
Let $a$ and $b$ be two triangles with vertices $a_0, a_1, a_2$ and $b_0, b_1, b_2$, respectively. Let $A$ and $B$ be the respective planes that the triangles lie on. Let $N_A$ and $N_B$ be the normal vectors of the respective planes $A$ and $B$. Observe that if all vertices of $b$ are on the same side of $A$, and none of the vertices lie on $A$, then we know that there is no overlap between $a$ and $b$.
So how can we test this? Recall (or take my word), that the equation of a plane defined by its normal vector $N$ is $N \cdot X + d = 0$ where $d$ is the signed distance of the point $X$ off of the plane. I believe a more familiar form introduced in a high school math class would be $ax + by + cz + d = 0$. This is exactly the same thing, in this case: $N = \langle a, b, c\rangle$ and $X = \langle x, y, z \rangle$.
Thus, our test begins by indicating no intersection for all triangles where $$ sign(N_B * a_0) = sign(N_B * a_1) = sign(N_B * a_2) \not = 0 \\ \text{OR} \\ sign(N_A * b_0) = sign(N_A * b_1) = sign(N_A * b_2) \not = 0 $$
In code, we have:
/// Tests if `b_verts` are all on the same side of the plane
/// defined by `pt_on_a` and `norm_a`
///
/// Returns a tuple of whether they are all on the same side
/// and the signed distance of each `b_vert` to the plane
fn plane_test(
    pt_on_a: &Point3<f64>,
    b_verts: &[Point3<f64>],
    norm_a: &Vector3<f64>,
) -> (bool, Vector3<f64>) {
    // N * x + d = 0
    // d = -N * x
    let d = dot(-1. * norm_a, pt_on_a.to_vec());
    let signed_dists = vec3(d, d, d)
        // project each of `b_verts` onto the norm of `a`
        + vec3(
            norm_a.dot(b_verts[0].to_vec()),
            norm_a.dot(b_verts[1].to_vec()),
            norm_a.dot(b_verts[2].to_vec()),
        );
    let all_same_side =
        signed_dists.x < 0. && signed_dists.y < 0. && signed_dists.z < 0.
            || signed_dists.x > 0.
                && signed_dists.y > 0.
                && signed_dists.z > 0.;
    (all_same_side, signed_dists)
}
fn moller_test(a_verts: &[Point3<f64>], b_verts: &[Point3<f64>]) -> bool {
    // compute normal defining plane that a lies on
    // compute this by taking cross product of two of the edges of the triangle
    let a_norm = (a_verts[2] - a_verts[0])
        .cross(a_verts[1] - a_verts[0])
        .normalize();
    // compute normal defining plane that b lies on
    let b_norm = (b_verts[2] - b_verts[0])
        .cross(b_verts[1] - b_verts[0])
        .normalize();
    let (b_same_side, b_dist_to_a) =
        Self::plane_test(&a_verts[0], b_verts, &a_norm);
    let (a_same_side, a_dist_to_b) =
        Self::plane_test(&b_verts[0], a_verts, &b_norm);
    if !b_same_side && !a_same_side {
        // perform more tests ...
    } else {
        false
    }
}
Now suppose that this first part of the test fails. Further suppose (for now), that the triangles are not coplanar. Similar to OBBs, we will be projecting points onto a line and testing the overlap of 1-dimensional intervals. It turns out, that for non-coplanar triangles, the only line, $L$, we need to test is aligned in the direction of the cross product of the normals of $A$ and $B$. So the direction of the line, $D$, is $N_A \times N_B$, and therefore, $L = P + tD$ for some point $P$ on the line and an arbitrary real $t$. The below figure from the paper 1 depicts the two possible situations:

Without loss of generality, suppose $a_0$ lies on one side of $B$ and $a_1$ and $a_2$ lie on the other side. Observe it will always be the case that one vertex lies on one side, and two vertices lie on the other side of the other triangle’s plane. In order to compare intervals on $L$, we want to find the $t$ values for the intersection points of edges $\overline{a_0a_1}$ and $\overline{a_0a_2}$ with $L$. To do this we first project each point onto the line. Let $q_i$ be the projection of point $a_i$ for $i$ either 0, 1, or 2. Then the $t$ value for the intersection of $\overline{a_0a_1}$ and $L$ can be computed as $$t_0 = q_1 + \frac{d_1}{d_1 - d_0}(q_0 - q_1)$$ where $d_i$ is the signed distance of the original point $a_i$ from $B$. Note that we cannot just project the vertices of each triangle onto the line as we did with OBBs. This is because, we can have a situation like the right side in the above figure, but the projected corner vertices cause the intervals to overlap.
The situation is depicted below:
$q_0 - q_1$ is the vector from $q_1$ to $q_0$. Keep in mind, this is a one-dimensional quantity, nevertheless, we can view it as a vector where its absolute value is its magnitude and its sign is its direction. Since triangle $a_1r_1s$ and $a_0r_0s$ are similar, then to find $s$ we add to $q_1$ a distance of $\frac{d_1}{d_1 - d_0}$ along $\overline{q_0q_1}$. This yields the earlier formula.
Before we look at the code, one can show that the result of the overlap test does not change if we project $L$ onto the coordinate axes with which it is most closely aligned. To find which axis that is, just take the directional component of $D$ with the greatest magnitude. So if the x-component of $D$ has the greatest absolute value, then we just need to check for overlap in the x coordinates of the vertices of $a$ and $b$.
So we have, as part of moller_test():
    let line = a_norm.cross(b_norm).normalize();
    // index of component with the greatest absolute value
    let idx = Self::abs_max_dim(&line);
    // coordinates of a "projected" onto L
    let a_onto_line =
        vec3(a_verts[0][idx], a_verts[1][idx], a_verts[2][idx]);
    // coordinates of b "projected" onto L
    let b_onto_line =
        vec3(b_verts[0][idx], b_verts[1][idx], b_verts[2][idx]);
    // get t values for intersection of edges of a with L
    let a_int = Self::get_interval(
        &a_onto_line,
        &a_dist_to_b,
        Self::opp_vert(&a_dist_to_b),
    );
    // get t values for intersection of edges of b with L
    let b_int = Self::get_interval(
        &b_onto_line,
        &b_dist_to_a,
        Self::opp_vert(&b_dist_to_a),
    );
    Self::interval_overlap(a_int, b_int)
/// Returns tuple of indices of the projected points such that
/// the first index is the index for the opposite vertex
/// and the remaining indices are the indices of the vertices
/// on the same side of the plane
fn opp_vert(v: &Vector3<f64>) -> (usize, usize, usize) {
    if v[0] * v[1] > 0. {
        (2, 0, 1)
    } else if v[0] * v[2] > 0. {
        (1, 0, 2)
    } else {
        (0, 1, 2)
    }
}
fn get_t(
    verts_on_l: &Vector3<f64>,
    dist_to_plane: &Vector3<f64>,
    opposite_idx: usize,
    vert_idx: usize,
) -> f64 {
    // t = q_1 + d_1 / (d_1 - d_0) * (q_0 - q_1)
    verts_on_l[vert_idx]
        + (verts_on_l[opposite_idx] - verts_on_l[vert_idx])
            * dist_to_plane[vert_idx]
            / (dist_to_plane[vert_idx] - dist_to_plane[opposite_idx])
}
fn get_interval(
    project_on_l: &Vector3<f64>,
    signed_dists: &Vector3<f64>,
    vert_indices: (usize, usize, usize),
) -> (f64, f64) {
    // vert_indices.0 is the index of the opposite vertex
    (
        Self::get_t(
            project_on_l,
            signed_dists,
            vert_indices.0,
            vert_indices.1,
        ),
        Self::get_t(
            project_on_l,
            signed_dists,
            vert_indices.0,
            vert_indices.2,
        ),
    )
}
/// Returns `true` if intervals `a_t` and `b_t` overlap
fn interval_overlap(a_t: (f64, f64), b_t: (f64, f64)) -> bool {
    let a_t = Self::order_interval(a_t);
    let b_t = Self::order_interval(b_t);
    a_t.0 - f64::EPSILON <= b_t.0 && a_t.1 + f64::EPSILON >= b_t.0
        || a_t.0 - f64::EPSILON <= b_t.1 && a_t.1 + f64::EPSILON >= b_t.1
        || b_t.0 - f64::EPSILON <= a_t.0 && b_t.1 + f64::EPSILON >= a_t.0
        || b_t.0 - f64::EPSILON <= a_t.1 && b_t.1 + f64::EPSILON >= a_t.1
}
Now we finally return to the case when the two triangles are coplanar (all signed distances are 0). In this case, we are left with a 2D triangle intersection problem. We can compute this by performing line intersection tests for each pair of edges. This boils down to a line-segment intersection test
fn line_intersection_2d(
    start_a: &Point2<f64>,
    end_a: &Point2<f64>,
    start_b: &Point2<f64>,
    end_b: &Point2<f64>,
) -> bool {
    let a = end_a - start_a;
    let b = end_b - start_b;
    // a 2D "cross-product"
    let cross_2d =
        |a: &Vector2<f64>, b: &Vector2<f64>| a.x * b.y - a.y * b.x;
    let rs = cross_2d(&a, &b);
    let qpr = cross_2d(&(start_b - start_a), &a);
    if rs.abs() < f64::EPSILON && qpr.abs() < f64::EPSILON {
        // all of the points lie on the same line
        // project onto `a` and test overlap
        let l = a.normalize();
        let t_a = (dot(start_a.to_vec(), l), dot(end_a.to_vec(), l));
        let t_b = (dot(start_b.to_vec(), l), dot(end_b.to_vec(), l));
        return Self::interval_overlap(t_a, t_b);
    } else if rs.abs() < f64::EPSILON {
        // the two edges are parallel
        return false;
    }
    let t = cross_2d(&(start_b - start_a), &b) / rs;
    let u = qpr / rs;
    // t and u between -e and 1 + e (roughly 0 and 1)
    (-f64::EPSILON..=1. + f64::EPSILON).contains(&t)
        && (-f64::EPSILON..=1. + f64::EPSILON).contains(&u)
}
fn triangle_intersection(
        a_verts: &[Point2<f64>],
        b_verts: &[Point2<f64>],
    ) -> bool {
    Self::line_intersection_2d(
        &a_verts[0],
        &a_verts[1],
        &b_verts[0],
        &b_verts[1],
    ) || Self::line_intersection_2d(
        &a_verts[0],
        &a_verts[1],
        &b_verts[0],
        &b_verts[2],
    ) // ... continue for all pairs (9 tests total)
}
So finally, we have:
fn moller_test(a_verts: &[Point3<f64>], b_verts: &[Point3<f64>]) -> bool {
    let a_norm = (a_verts[2] - a_verts[0])
        .cross(a_verts[1] - a_verts[0])
        .normalize();
    let b_norm = (b_verts[2] - b_verts[0])
        .cross(b_verts[1] - b_verts[0])
        .normalize();
    let (b_same_side, b_dist_to_a) =
        Self::plane_test(&a_verts[0], b_verts, &a_norm);
    let (a_same_side, a_dist_to_b) =
        Self::plane_test(&b_verts[0], a_verts, &b_norm);
    if !b_same_side && !a_same_side {
        if Self::is_coplanar(&b_dist_to_a) {
            return Self::coplanar_test(a_norm, a_verts, b_verts);
        }
        let line = a_norm.cross(b_norm).normalize();
        let idx = Self::abs_max_dim(&line);
        let a_onto_line =
            vec3(a_verts[0][idx], a_verts[1][idx], a_verts[2][idx]);
        let b_onto_line =
            vec3(b_verts[0][idx], b_verts[1][idx], b_verts[2][idx]);
        let a_int = Self::get_interval(
            &a_onto_line,
            &a_dist_to_b,
            Self::opp_vert(&a_dist_to_b),
        );
        let b_int = Self::get_interval(
            &b_onto_line,
            &b_dist_to_a,
            Self::opp_vert(&b_dist_to_a),
        );
        Self::interval_overlap(a_int, b_int)
    } else {
        false
    }
}
Once a collision is detected, our game engine needs to resolve the collision by doing something meaningful. For some applications, it suffices to just move all the dynamic objects in the scene, test collisions, and then roll back the motion of any objects that have collided. Some objects, such as an arrow shot from a bow, can be made static once a collision is detected.
In the case of Project Oort, it was not sufficient to just “undo” the movement of a colliding object. For example, an object’s velocity might have an x and y component. The object might collide with something along the x-axis, however, it should still keep moving along the y-axis.
The (relatively simple) physics engine of Project Oort is complex enough to warrant another post, so I won’t go too much into the details here. The gist is, that when two objects collide, we fetch all of the normal vectors of any triangle where an intersection was detected. We then compute the average of these normals for each colliding body. So if body $A$ collides with body $B$, $N_A$ is the averaged normal vector of all intersecting triangles of $A$, and $N_B$ is the averaged normal vector of all intersecting triangles of $B$ then we can move $A$ along the direction of $N_B$ and vice versa.
Very briefly, what I did with this information was apply the Impulse-Momentum theorem to compute a force acting along each normal vector. So in the above example, $B$ is acted on by a force in the direction $A$. Since there can be multiple collisions from multiple directions along with other forces such as the force of thrust for the starfighters, all forces are added together (as vectors) to get the total force, which then can be used to compute the linear and rotational acceleration.
The core idea is when testing collisions, we return HitData which
encapsulates the estimated point of impact and normal vector for a collision.
pub struct HitData {
    /// The point of collision and the impact normal on the first collider's mesh
    pub pos_norm_a: (Point3<f64>, Vector3<f64>),
    /// The point of collision (and the normal computed from it) on the second collider's mesh
    pub pos_norm_b: (Point3<f64>, Vector3<f64>),
}
We compute this from a list of intersecting triangles from each object:
/// Hit data from triangles that a high-accuracy test determines are colliding
///
/// Requires that for all triangles in `colliders_a`, there exists a triangle that 
/// intersects it in `colliders_b` and vice versa
fn hit_from_colliders(
    colliders_a: &[&Triangle<f32>],
    colliders_b: &[&Triangle<f32>],
    model_a: &Matrix4<f64>,
    model_b: &Matrix4<f64>,
) -> HitData {
    let avg_pt_norm = |colliders: &[&Triangle<f32>], model: &Matrix4<f64>| {
        let mut avg_point = point3(0f64, 0., 0.);
        let mut avg_norm = vec3(0f64, 0., 0.);
        for t in colliders {
            avg_point += t.centroid().to_vec();
            let sum: Vector3<f32> = t.norms().into_iter().sum();
            avg_norm += sum.cast().unwrap() / 3f64;
        }
        avg_point /= colliders_a.len() as f64;
        avg_norm /= colliders_a.len() as f64;
        let norm_trans = Matrix3::new(
            model.x.x, model.x.y, model.x.z, model.y.x, model.y.y, model.y.z,
            model.z.x, model.z.y, model.z.z,
        );
        (
            model.transform_point(avg_point),
            (norm_trans * avg_norm).normalize(),
        )
    };
    HitData {
        pos_norm_a: avg_pt_norm(colliders_a, model_a),
        pos_norm_b: avg_pt_norm(colliders_b, model_b),
    }
}
Finally, we use this information to make updates to an object’s velocities in the physics engine:
    // ...
    self.vel -= impulse / body.mass * norm;
    self.rot +=
        body_inertia.invert().unwrap() * (impulse * norm).cross(body_lever)
Everything we talked about so far can give us a pretty decent collision detection algorithm. With the knowledge we developed up to this point, one might come up with the following approach:
A key idea in real-time collision detection is the concept of multiple passes. We use cheaper and less accurate methods of collision detection first to reject as many non-colliding objects as we can before moving on to more accurate methods for the cases we need to. The above proposal has 2 passes. Now OBBs aren’t exactly cheap, so let’s add another pass:
While this proposed algorithm is accurate, it’s not fast enough. The remainder of the post will discuss how we can improve upon it, and we’ll generate a 224x speedup.
Let’s further optimize step 1. What is the running time of checking the distance of an object to every other object? Well, that’s $O(N)$ for $N$ objects. Therefore, testing the distance of all objects with every other object must be $O(N^2)$. Is that the best we can do?
Well, consider the following 2D scenario where we are testing collisions with the red circle in the top left:
Intuitively, we can easily throw out all objects that don’t overlap with the quadrant that the red circle belongs to.
We can then split the quadrant we are left with into 4 sub-quadrants and repeat the process until some stopping criteria are met (such as less than $K$ objects left in the quadrant for some arbitrarily chosen $K$).
What is the time complexity of this approach? Assuming determining which objects overlap with the quadrant of the object we are testing takes constant time, we get $O(N \log N)$ since each step reduces the number of objects we need to consider by $\frac34$.
What we just described is a type of space partition tree, in particular, a quadtree in 2D and an octree in 3D. To achieve constant lookup, each node in the tree must keep track of the objects that belong in it. What if we have an object that intersects multiple quadrants? There are two general approaches to this: put the object in all quadrants that intersect it or allow objects to belong to non-leaf nodes as well as leaf nodes. The former yields faster lookup times in exchange for increased complexity when updating and inserting objects; in the example above, the big circle in the center could be eliminated in step 3. For Project Oort, I opted for the latter since every object is dynamic. This is what we’ll focus on for the rest of the discussion.
Let’s see another example, this time we’ll also display the tree structure.
In the above example, the right side of the diagram depicts the tree storing the objects. The numbers to the right side of a node correspond to the ids of each object that belongs to it. If every object has a reference to the tree node that owns it, then identifying which objects we must check simply requires collecting all objects as we walk up the tree. In this example, we must further examine objects 5, 7, and 1 while we can immediately reject objects 2, 4, and 3.
Now consider we are testing object 7. Must we also check objects 6 and 5? While it is true that they can collide, we need not check them because we already check for a 6-7 and 5-7 intersection when we get possible colliders of 6 and 5, respectively.
In 3D, the only difference is a node will have 8 children instead of 4. Each node will be defined by a center and half-width.
/// A node in an octree holding weak references to objects that fit in it and
/// (optionally) 8 child nodes
///
/// Each node is a `2 * h_width` x `2 * h_width` x `2 * h_width` box centered around `center`
pub(super) struct ONode {
    center: Point3<f64>,
    h_width: f64, // dist to center to any axis-aligned side of AABB
    objects: ObjectList,
    // in this implementation, a node either had all 8 children, or is a leaf
    children: Option<[Rc<RefCell<ONode>>; 8]>, // ith bit in children index is 1 if ith coordinate is > center
    // parent reference so we can walk up the tree
    parent: Weak<RefCell<ONode>>,
    self_ref: Weak<RefCell<ONode>>,
    // index this node is in its parent's children array
    self_index: u8,
}
We’ll cleverly index the children so that the $i^\text{th}$ bit of the index is 1 if
the $i^\text{th}$ coordinate of the child’s
center is greater than the $i^\text{th}$ coordinate of the parent’s center.
Consider child index 2. In binary, 2 is 010. Therefore, this child’s center has a smaller
x value, greater y value, and smaller z value.
| z is greater? | y is greater? | x is greater? | 
|---|---|---|
| 0 | 1 | 0 | 
For child index 4, we have that its z coordinate is greater, but its x and y
coordinates are less than the centers’ since 4 == 0b100.
| z is greater? | y is greater? | x is greater? | 
|---|---|---|
| 1 | 0 | 0 | 
This also makes it easy to determine what octant an object should belong:
/// Gets octant index or `None` if object is in multiple octants
///
/// The ith bit of the index is 1 if the ith coordinate of the object is > the
/// ith coordinate of the center
fn get_octant_index(
    center: &Point3<f64>,
    h_width: f64,
    obj: &Rc<RefCell<Object>>,
) -> Option<u8> {
    let o = obj.borrow().center() - center;
    let mut index = 0u8;
    for i in 0..3 {
        if o[i].abs() < obj.borrow().radius()
            || o[i].abs() + obj.borrow().radius() > h_width
        {
            // if radius is greater than distance to center -> object is in multiple octants
            // if radius + distance to center is greater than half width
            // -> object does not fit in this node (goes beyond the boundaries)
            return None;
        } else if o[i] > 0. {
            // if ith coordinate is greater than the center's ith coordinate:
            // set the ith bit in the index to 1
            index |= 1 << i;
        }
    }
    Some(index)
}
To insert an object, into a node:
pub(super) fn insert(&mut self, obj: &Rc<RefCell<Object>>) {
    if self.children.is_none()
        && self.objects.len() + 1 < Self::MAX_OBJS_PER_LEAF
    {
        // step 1
        obj.borrow_mut().octree_cell = self.self_ref.clone();
        self.objects.push(Rc::downgrade(obj));
        return;
    } else if self.children.is_none() {
        // step 2
        self.children = Some(Self::create_children(
            &self.self_ref,
            &self.center,
            self.h_width,
        ));
        self.objects = Self::split_into_children(
            self.children.as_mut().unwrap(),
            &mut self.objects,
            &self.center,
            self.h_width,
        );
        // continue to step 3
    }
    // step 3
    match Self::get_octant_index(&self.center, self.h_width, obj) {
        Some(idx) => self.children.as_mut().unwrap()[idx as usize]
            .borrow_mut()
            .insert(obj),
        None => {
            obj.borrow_mut().octree_cell = self.self_ref.clone();
            self.objects.push(Rc::downgrade(obj));
        }
    }
}
/// Splits objects in `objects` into octree nodes of `children`
///
/// Returns the new object list for the current node
fn split_into_children(
    children: &mut [Rc<RefCell<Self>>; 8],
    objects: &mut [Weak<RefCell<Object>>],
    center: &Point3<f64>,
    h_width: f64,
) -> Vec<Weak<RefCell<Object>>> {
    let mut new_objs = Vec::new();
    for obj in objects
        .iter()
        .filter(|x| x.strong_count() > 0)
        .map(|x| x.upgrade().unwrap())
    {
        match Self::get_octant_index(center, h_width, &obj) {
            Some(idx) => {
                children[idx as usize].borrow_mut().insert(&obj);
            }
            None => new_objs.push(Rc::downgrade(&obj)),
        }
    }
    new_objs
}
/// Creates child nodes for the given parent
fn create_children(
    parent: &Weak<RefCell<Self>>,
    // parent center
    parent_c: &Point3<f64>,
    // parent half-width
    parent_h: f64,
) -> [Rc<RefCell<Self>>; 8] {
    let mut res = arr_macro::arr![Rc::new(RefCell::new(ONode::empty())); 8];
    // child half-width
    let step = parent_h / 2.0;
    for (child, idx) in res.iter_mut().zip(0u32..8) {
        let step_x = if idx & 1 == 1 { step } else { -step };
        let step_y = if idx & 2 > 0 { step } else { -step };
        let step_z = if idx & 4 > 0 { step } else { -step };
        child.borrow_mut().center = parent_c + vec3(step_x, step_y, step_z);
        child.borrow_mut().h_width = step;
        child.borrow_mut().parent = parent.clone();
        child.borrow_mut().self_ref = Rc::downgrade(child);
        child.borrow_mut().self_index = idx as u8;
    }
    res
}
To update an object in the tree, we first fetch the current node it belongs to. Then:
/// Indicates that `obj` has changed and should be re-evaluated for placement in the octree
///
/// If `obj` no longer fits in the octree, it remains in the root node
///
/// Requires that `obj` is already in this node
pub(super) fn update(&mut self, obj: &Rc<RefCell<Object>>) {
    if let Some(parent) = self.parent.upgrade() {
        if Self::get_octant_index(
            &parent.borrow().center,
            parent.borrow().h_width,
            obj,
        ) != Some(self.self_index)
        {
            // if the object no longer should be in this node
            self.objects.retain(|o| {
                o.strong_count() > 0
                    && !Rc::ptr_eq(&o.upgrade().unwrap(), obj)
            });
            parent.borrow_mut().insert(obj);
            // insert will not "bubble up" an object that moved out of it
            // Let's say the object belongs in the grandparent, insert will
            // leave the object in the parent
            //
            // so if the object is held by the parent (not moved to a sibling),
            // call update to make sure it should not bubble up to the grandparent
            let obj_holder = obj.borrow().octree_cell.upgrade().unwrap();
            if Rc::ptr_eq(&obj_holder, &parent) {
                parent.borrow_mut().update(obj);
            }
        }
    }
    if let Some(child_idx) =
        Self::get_octant_index(&self.center, self.h_width, obj)
    {
        // if the object should now be in a child node
        if let Some(children) = self.children.as_mut() {
            self.objects.retain(|o| {
                o.strong_count() > 0
                    && !Rc::ptr_eq(&o.upgrade().unwrap(), obj)
            });
            children[child_idx as usize].borrow_mut().insert(obj);
        }
    }
    // otherwise do nothing (leave the object in this node)
}
Finally, as discussed earlier, to get colliders we just walk up the tree. In this implementation, the Octree also performs the bounding sphere collisions.
/// Gets all objects that have overlapping bounding spheres as
/// `test` object that is a parent of `test_obj` or contained within
/// `node`
///
/// `node` - the containing octree cell of `test_obj`
fn get_parent_colliders(
    node: &Rc<RefCell<Self>>,
    test_obj: &Rc<RefCell<Object>>,
) -> Vec<Rc<RefCell<Object>>> {
    // gets all objects owned by `node` that collide with
    // `test_obj`
    let mut v = Self::get_self_colliders(node, test_obj);
    let mut n = node.borrow().parent.clone();
    // walk up the tree, iterative style
    while let Some(parent) = n.upgrade() {
        // remove dead references
        parent.borrow_mut().objects.retain(|x| x.strong_count() > 0);
        // check all objects in the parent that collide:
        for obj in
            parent.borrow().objects.iter().map(|x| x.upgrade().unwrap())
        {
            if obj.borrow().bounding_sphere_collide(&*test_obj.borrow()) {
                v.push(obj);
            }
        }
        // move to grandparent
        n = parent.borrow().parent.clone();
    }
    v
}
The real bottleneck of the proposed algorithm is step 3: checking all intersecting triangles. This level of granularity is key to our general, accurate, collision detection algorithm yet it’s also really slow. Even low-poly models can have a few thousand primitives. Checking every pair of triangles of two models can easily require upwards of 25,000,000 triangle-triangle checks. What we need is a way to quickly narrow down how many pairs of triangles we need to examine.
Consider the following 2D example of a car hitting a box:
The bounding boxes of the car and box are overlapping, but, intuitively, there’s no reason for us to check triangle-triangle collisions between the car’s rear wheel and the box. Perhaps we only check collisions with triangles that are near enough to collide? In other words: we could put a bounding sphere around each triangle. But that still doesn’t reduce the number of checks we need to perform, it just adds a slightly faster early-out condition. Perhaps we can use an Octree again? That’s moving in the right direction, but recall that when using an Octree an object can belong to potentially many nodes which means it has to be checked multiple times. Here’s another idea: what if we subdivide each bounding box into smaller bounding boxes? So after determining that the root bounding box collides, we now check the following bounding boxes:

We then recurse, throwing out large chunks of the car as we determine there is no collision in that area:

We repeat this process until some stopping condition is met, such as the amount of triangles contained within a leaf node is too low, or the tree depth is too great. What we just described is a type of bounding volume hierarchy. Specifically, in this case, an AABB hierarchy.
We’ll implement a 3D AABB hierarchy. When we perform the bounding box intersections, we’ll convert them to OBBs so we can transform them the same way the object is transformed, but for simplicity, we’ll construct the tree using AABBs.
Given a set of triangles, we will construct a BVH node by first computing an AABB that can encompass all of the triangles. This will be the largest bounding volume for the node. If we should continue splitting, then we will identify which axis has the greatest spread of data points and split the triangles in half along that direction to form two child nodes.
With the car, we get something like this after a few iterations:



In code, we have something like the following:
fn new(
    triangles: Vec<Triangle<T>>,
    recursion_depth: u32,
    stop: TreeStopCriteria,
) -> Self {
    let volume = BoundingVolume::Aabb(aabb_from_triangles(&triangles));
    if stop.should_stop(triangles.len(), recursion_depth) {
        // we should not continue splitting
        Self {
            left: None,
            right: None,
            volume,
            triangles: Some(triangles),
        }
    } else {
        let split = largest_extent_index(&volume);
        // split along axis `split`
        // `split = 0 => x-axis`
        // `split = 1 => y-axis`
        // `split = 2 => z-axis`
        Self::with_split(triangles, split, volume, recursion_depth, stop)
    }
}
/// Returns an AABB that fully contains `triangles`
fn aabb_from_triangles<T: BaseFloat>(triangles: &[Triangle<T>]) -> Aabb {
    let v: Vec<Point3<T>> = triangles
        .iter()
        .flat_map(|s| s.verts().into_iter())
        .collect();
    Aabb::from(&v)
}
impl Aabb {
    /// Computes an AABB from points
    /// `points` - local space point cloud
    /// `model` - world space transformation
    pub fn from<T: BaseNum>(points: &[Point3<T>]) -> Self {
        let mut mins = vec3(f64::MAX, f64::MAX, f64::MAX);
        let mut maxs = vec3(f64::MIN, f64::MIN, f64::MIN);
        for pt in points {
            let pt = pt.cast().unwrap();
            mins.x = mins.x.min(pt.x);
            mins.y = mins.y.min(pt.y);
            mins.z = mins.z.min(pt.z);
            maxs.x = maxs.x.max(pt.x);
            maxs.y = maxs.y.max(pt.y);
            maxs.z = maxs.z.max(pt.z);
        }
        let center = (mins + maxs) / 2.0;
        let extents = maxs - center;
        Self {
            center: point3(center.x, center.y, center.z),
            extents,
        }
    }
    // ...
}
We take the dimension to split on as the dimension with the greatest extent in the node’s bounding box. So if the bounding box’s width is greater than its height and depth, we split it on the x-axis.
/// Gets the index of the largest extent of `aobb`
fn largest_extent_index(aobb: &BoundingVolume) -> usize {
    let mut idx = 0;
    let mut max_extents = f64::MIN;
    for i in 0..3 {
        if aobb.extents()[i] > max_extents {
            max_extents = aobb.extents()[i];
            idx = i;
        }
    }
    idx
}
Then we put each triangle into a child based on whether the triangle’s centroid is greater or less than the node’s bounding box in the dimension we are splitting on.
/// Creates a new internal `BVHNode` with bounding volume `volume`
///
/// It's children will be given triangles divided based on being less than or greater 
/// than `volume`'s center along the `split` axis. `split` of `0` indicates the `x`
/// coordinates are being divided, whereas `split` of `2` are the `z` coordinates.
fn with_split(
    triangles: Vec<Triangle<T>>,
    split: usize,
    volume: BoundingVolume,
    rec_depth: u32,
    stop: TreeStopCriteria,
) -> Self {
    let mut left = Vec::<Triangle<T>>::new();
    let mut right = Vec::<Triangle<T>>::new();
    // divide each triangle among the node's soon to be children
    for tri in triangles {
        if tri.centroid()[split] < volume.center()[split] {
            left.push(tri);
        } else {
            right.push(tri);
        }
    }
    if left.is_empty() || right.is_empty() {
        // if all triangles are on the same side, stop growing
        // this can happen if the node contains a really large triangle
        left.append(&mut right);
        // left now contains all the node's triangles
        Self {
            left: None,
            right: None,
            volume,
            triangles: Some(left),
        }
    } else {
        // recurse
        Self {
            left: Some(Box::new(Self::new(left, rec_depth + 1, stop))),
            right: Some(Box::new(Self::new(right, rec_depth + 1, stop))),
            volume,
            triangles: None,
        }
    }
}
In contrast to our Octree implementation, the triangles are not going to move relative to each other. In other words, we don’t have any animated models. If we did, some more work would be required to handle animations.
So the last thing we have to worry about is determining which triangles to check.
If two bounding boxes collide, we then test collisions between the smaller bounding box and the two sub-volumes of the larger bounding box. Of course, this isn’t the only way to descend the hierarchy, however in practice it’s a pretty good heuristic.2 If one bounding box has no children, then we subdivide the non-leaf bounding volume. We keep repeating these steps until we are left with two leaf nodes. The triangles we have to check are the triangles that belong to both leaf nodes.
/// Descends the collision hierarchy, descending into the largest
/// nodes first.
///
/// `on_both_leaf` - function called when two colliding nodes are leaf nodes
/// The first parameter is always a descendant of `self` and the second parameter
/// is the colliding descendant of `other`
fn descend_heirarchy<F>(
    &self,
    self_transform: &Matrix4<f64>,
    other: &Self,
    other_transform: &Matrix4<f64>,
    mut on_both_leaf: F,
) where
    F: FnMut(&Self, &Self),
{
    // stack of tuples of 
    // (bounding box belonging to the receiver node,
    //      bounding box belong to the other node)
    let mut stack = VecDeque::<(&Self, &Self)>::new();
    // push the top level bounding boxes
    stack.push_front((self, other));
    while !stack.is_empty() {
        let (a, b) = stack.pop_front().unwrap();
        if !a.volume.is_colliding(
            Some(self_transform),
            &b.volume,
            Some(other_transform),
        ) {
            continue;
        }
        if a.is_leaf() && b.is_leaf() {
            // execute call back when both nodes are leaves.
            // There can be multiple such colliding leaves
            // so don't quit early
            on_both_leaf(a, b);
        } else if a.should_descend(b) {
            // descend a
            // so add (a.right, b) and (a.left, b) to the
            // stack
            a.right
                .as_ref()
                .map(|x| stack.push_front((&*x, b)))
                .unwrap();
            a.left.as_ref().map(|x| stack.push_front((&*x, b))).unwrap();
        } else {
            // descend b
            // so add (a, b.right) and (a, b.left)
            b.right
                .as_ref()
                .map(|x| stack.push_front((a, &*x)))
                .unwrap();
            b.left.as_ref().map(|x| stack.push_front((a, &*x))).unwrap();
        }
    }
}
/// Returns `true` if we should descend this BVH hierarchy, 
/// otherwise `false` to indicate we should descend 
/// `other` during a collision query
#[inline]
fn should_descend<F: BaseFloat>(&self, other: &BVHNode<F>) -> bool {
    !self.is_leaf()
        && (self.volume.vol() > other.volume.vol() || other.is_leaf())
}
Together, these two data structures provide a massive speedup over the naive approach. But we can do better!
Let’s consider the following pseudocode for checking triangle collisions:
let mut a_colliding_triangles = Vec::new();
let mut b_colliding_triangles = Vec::new();
for a in a_triangles {
    for b in b_triangles {
        if a.collide(b) {
            a_colliding_triangles.push(a);
            b_colliding_triangles.push(b);
        }
    }
}
Notice that each iteration does not depend on any other iterations. This is what we call an embarrassingly parallel algorithm! We can even rewrite this code to use a single loop instead of two:
for i in 0 .. a_triangles.len() + b_triangles.len() {
    let a_index = i / a_triangles.len();
    let b_index = i % a_triangles.len();
    let a = &a_triangles[a_index];
    let b = &b_triangles[b_index];
    if a.collide(b) {
        a_colliding_triangles.push(a);
        b_colliding_triangles.push(b);
    }
}
If we had the compute to run all a_triangles.len() + b_triangles.len()
checks at once, we could bring down the running time of checking triangles
tremendously! While we can’t perform all the checks in parallel, we can perform
a huge number of them at once by employing our GPU!
There are many frameworks for performing parallel computation on specialized hardware, but I used the compute shaders in OpenGL since I already had the infrastructure written for them from implementing Forward+ rendering.
In OpenGL, when we execute a compute shader, we must provide the size of a
work group and how many to execute. A work group has an x, y, and z dimension
that we can specify, however, we will only use x and y dimensions since our data
is two-dimensional (a_triangles index and b_triangles index). Within a work group,
a compute shader defines a local size which is the number of shader invocations
processed by the work group. So a local size of (x = 8, y = 8, z = 1) and
a work group size of (x = 10, y = 10, z = 1) yields 6400 total shader invocations.
Each invocation within a workgroup will be executed concurrently and they
can communicate via shared variables. Different work groups can be executed
in any order, and can’t easily communicate. Synchronization within a work group
can be achieved via atomic loads or stores or barriers. Each invocation within a work group
is given a unique id. This can be accessed within the shader using the variable
gl_LocalInvocationID. Furthermore, each work group has a unique id readable from
gl_WorkGroupID. Together, this gives a unique id for every invocation among all
work groups, accessible from gl_GlobalInvocationID.
The general structure is that we use a shader storage buffer to pass a_triangles,
and b_triangles to the GPU. In the shader, we use the global invocation id to
determine which two triangles to test.
struct Triangle {
    vec4 v[4];
};
layout(std430, binding = 5) readonly buffer ATriangles {
    Triangle a_triangles[];
};
layout(std430, binding = 6) readonly buffer BTriangles {
    Triangle b_triangles[];
};
layout(std430, binding = 7) writeonly buffer Collisions {
    // array of a_traingles followed by b_triangles
    // Assumes that the buffer is sent to the GPU zeroed out
    // Each element is not 0 if its corresponding triangle has a collision
    uvec4 out_buffer[];
};
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
A triangle only needs 3 vec3s, however, we must take into account the
alignment requirements specified in
Section 7.6.2.2 of the OpenGL Spec.
In it, it specifies how a vec3 is treated as a vec4 and an array of vectors
or scalars follows the same rule (ie. int[3] is treated as int[4] and
vec4[3] is treated as vec4[4]).
I chose to define Triangle such that it requires no padding so the shader Triangle definition matches the definition in Rust.
#[derive(Clone, Copy, Debug)]
#[repr(C)]
#[repr(align(64))]
struct ShaderTriangle {
    _a: [f32; 4],
    _b: [f32; 4],
    _c: [f32; 4],
    _d: [f32; 4],
}
Each shader invocation will examine the two triangles specified by the
invocation’s unique id, and if there’s a collision, write a 1 to a buffer
in the locations representing each triangle.
void main() {
    uvec2 location = gl_GlobalInvocationID.xy;
    // check if ID is valid since we cannot execute less invocations
    // than specified by a single work group
    if (location.x < a_triangles.length() && location.y < b_triangles.length()) {
        mollerTriangleTest(location);
    }
}
// Moller's Triangle-Triangle interval overlap method
void mollerTriangleTest(uvec2 location) {
    Triangle a = a_triangles[location.x];
    Triangle b = b_triangles[location.y];
    // if collision
    {
        // indicate triangle a has a collision
        out_buffer[location.x] = uvec4(1);
        // indicate triangle b has a collision
        out_buffer[a_triangles.length() + location.y] = uvec4(1);
    }
}
Note that writing to the output buffer is technically a data race. However, this is OK because all invocations write the same thing, the buffer is zeroed before dispatched, and we only care if an element is nonzero and not what’s stored in it.
Now the CPU code for a collision check just needs to format the data correctly, compute the number of work groups, dispatch the compute shader, and identify colliding triangles from the output buffer:
fn collide(
        &self,
        a_triangles: &[Triangle<f32>],
        a_mat: &Matrix4<f64>,
        b_triangles: &[Triangle<f32>],
        b_mat: &Matrix4<f64>,
    ) -> Option<Hit> {
    let a_in_triangles: Vec<ShaderTriangle> = a_triangles
        .iter()
        .map(Self::get_triangle_to_mat_func(*a_mat))
        .collect();
    let b_in_triangles: Vec<ShaderTriangle> = b_triangles
        .iter()
        .map(Self::get_triangle_to_mat_func(*b_mat))
        .collect();
    let a = ComputeInputBuffer {
        buf: ssbo::Ssbo::create_static(&a_in_triangles),
        len: a_triangles.len() as u32,
    };
    let b = ComputeInputBuffer {
        buf: ssbo::Ssbo::create_static(&b_in_triangles),
        len: b_triangles.len() as u32,
    };
    // map index into a_triangles onto x dimension
    // number of work groups is then a.len() / WORK_GROUP_SIZE,
    // rounded up
    let work_groups_x = ((a.len
        + a.len % TriangleTriangleGPU::WORK_GROUP_SIZE)
        / TriangleTriangleGPU::WORK_GROUP_SIZE)
        .max(1);
    let work_groups_y = ((b.len
        + b.len % TriangleTriangleGPU::WORK_GROUP_SIZE)
        / TriangleTriangleGPU::WORK_GROUP_SIZE)
        .max(1);
    let output = self.execute_collision_compute(
        &a,
        &b,
        work_groups_x,
        work_groups_y,
    );
    let mut a_colliders = Vec::new();
    let mut b_colliders = Vec::new();
    for (e, idx) in
        output.map_read().as_slice().iter().zip(0..a.len + b.len)
    {
        if e[0] > 0 {
            if idx < a.len {
                a_colliders.push(&a_triangles[idx as usize]);
            } else {
                b_colliders.push(&b_triangles[(idx - a.len) as usize]);
            }
        }
    }
    Self::colliders_from_output(a_mat, b_mat, &a_colliders, &b_colliders)
}
For completeness, I’ll quickly discuss the execution model of CUDA, a popular parallel computing framework3. CUDA has an organizational hierarchy of grids,
thread blocks, and threads. A thread block is similar to OpenGL’s work group,
it’s a set of concurrent threads that can cooperate via barrier synchronization
and shared memory. A grid is a set of independent and parallel thread blocks.
The user specifies the number of threads per block and the number of blocks in
a grid. Like OpenGL, each block has a unique id within the grid, and each thread
has a unique id within its block. CUDA also distinguishes between shared and
device memory. Shared memory is akin to cache for the accelerator, and device
memory is akin to onboard DRAM. On video cards, this is typically called VRAM.
In OpenGL, shared variables would be similar to CUDA’s shared memory, and
buffers would probably map more closely to CUDA’s device memory.
So it’s time to discuss the fruit of our labors, how much faster can we perform
collision detection with these optimizations? To test this, I computed a 90-second
rolling average of the time it took to perform collision detection. Specifically,
I timed the duration it took to run the get_resolving_forces method, which
runs through every dynamic object in the scene, finds all colliding objects,
computes resolving forces, and adds them up for each object. I measured the times
in microseconds using the system clock. I ran the tests on an Ubuntu 22.04 LTS
machine with an Intel i7-4790, 16GB DDR3 memory, and an NVIDIA GTX 1070.
This was not a very scientific test as
I did not fix the seed used to generate the asteroids nor did I pre-program what
would happen for those 90s. I simply flew around bumping into things, playing
the game in very roughly the same way. Despite that, I think the measurements
are good enough to get a sense of how effective optimizations are and identify a
few key takeaways.
There is a tremendous amount of room to further optimize. Each algorithm has a plethora of policy choices and knobs to tune that I didn’t even try adjusting such as work group and local size, other tree stop criteria for the BVH or Octree, using different bounding volumes in the BVH, how to descend the BVH, handling objects overlapping multiple octants differently, computing the split dimension in the BVH, and adding or removing passes.
Furthermore, I designed the collision detection algorithm almost entirely based on how fun it would be to implement. I didn’t even consider measuring performance until writing this post, which is not the approach you’d want to take for delivering a real product.
With that being said, our baseline algorithm proposed at the end of section 3 clocked in at a whopping 2,118,699$\mu$s. That’s an unacceptable 2.118s. This is well into seconds per frame territory instead of FPS. The algorithm using an Octree with max 32 objects per leaf (denoted Octree(32)), the BVH with 1024 primitives per leaf (denoted BVH(1024)), and the compute shader, clocks in at a speedy 9,448$\mu$s or 9.4ms. This is a speedup of 224.25x!
Here’s all the data I collected:
This reveals a few things right off the bat:
I’d also like to point out that for sequential intersection tests, I used a BVH
with 32 primitives per leaf, but for parallel intersections, I used 1024 primitives.
This is because for the CPU, we want to narrow down the number of triangles
we need to check as much as we can, but for the GPU, we want to use enough primitives
so that data movement costs don’t dominate.
I was a bit surprised that BVH(4), Octree(32) did so poorly, but perhaps
something like BVH(16) is the sweet spot for sequential triangle-triangle
intersections.
Here are just the main sequential variants of the algorithm:
We can really see that the Octree helps, but barely. Speaking of which, I looked into different numbers for the child creation threshold. 32 seems to be a happy medium, at least for the sequential algorithm:
But again, the differences are minuscule compared to the speedup due to parallelization.
Keep in mind that all the tests were done on the 3-pass algorithm using bounding spheres, OBBs, and triangle intersections.
We walked through the development of the 3-pass real-time collision detection algorithm used in Project Oort. There is a lot of work that can be done to this. The book Real-time Collision Detection by Ericson2 is a great resource that covers everything we discussed here and much more.
There is a key issue with the current algorithm: in every frame, we test for a collision solely based on the present location of all objects. It’s possible for an object to go fast enough, or collide with another object small enough, such that in one frame the object is right in front of a collider, and in the next frame it’s right behind the collider. In both cases, there is no collision despite the object traveling through the collider.
The laser would be most susceptible to this bug, but to solve that we could test the collision of a swept ellipsoid with a mesh4.
Hopefully, this post was helpful, thanks for reading!
Möller, T. (1997). A Fast Triangle-Triangle Intersection Test. Journal of Graphics Tools, 2(2), 25–30. doi:10.1080/10867651.1997.10487472 ↩︎ ↩︎
Christer Ericson. (2006). Real-time collision detection. Elsevier, [20]10. ↩︎ ↩︎
M. Garland et al., “Parallel Computing Experiences with CUDA,” in IEEE Micro, vol. 28, no. 4, pp. 13-27, July-Aug. 2008, doi: 10.1109/MM.2008.57. ↩︎
Fauerby, K. (2003). Improved Collision detection and Response. NP ↩︎