Annoying details of a Z-buffer rasterizer

January 4th, 2024

I wrote a software rasterizer for occlusion culling and hit many small speed bumps along the way. Here I reveal what I’ve learned in the hope of you writing your own with more ease than I did.

This is what the system was for:

This page talks about a depth-only rasterizer for small resolutions, like 160x120 at most. It’s not that different from doing tiled rendering at a higher resolution though.

This stuff will probably only make sense if you’re working on something similar yourself right now. Marcus Svensson’s 2016 thesis is a good place to start if you wish to familiarize yourself with the problem. It has a good overview of this particular technique, of conservative rasterization, and discusses the issue of silhouette extraction.

On to the implementation details.

Take care of your deltas

During triangle setup you need to compute per-pixel differences for Z in X and Y axes. These are used to increment an interpolated depth value when moving from one pixel to the next. I call them the deltas below.

You need to compute the deltas in snapped integer coordinates. If you do it in floating point, the interpolated Z may overflow because you are rendering a different triangle than you set up.

The formula for computing the deltas can be derived in two ways:

The first point of view is commonly taken because it’s simpler. In the end the code ends up being something like this:

// Floating point coordinates but snapped to subpixels already.
vec3f v01 = vec3f(v1.x, v1.y, z1) - vec3f(v0.x, v0.y, z0);
vec3f v02 = vec3f(v2.x, v2.y, z2) - vec3f(v0.x, v0.y, z0);

// Compute the triangle normal (not a unit vector)
vec3f N = cross3d(v01, v02);

// Now "the deltas" for X and Y increments
float dZdX = -N.x / N.z;
float dZdY = -N.y / N.z;

In which vec3f is a floating point 3-element vector.

The deltas above can become very large in small triangles. As a workaround I made the triangle setup code use a constant Z if the triangle area was small enough. For rendering occluders it uses the maximum of the vertex depths, and for testing occludees against the Z-buffer I used the minimum respectively. This way the depth tests stay conservative, i.e. never hide objects that should be visible.

The divison in the setup code above is hard to do in fixed point. I just did it with floats and converted to fixed point afterwards. I’m sure there’s some clever way to do it with integers though, for example using a piece-wise linear approximation is suggested in “Small Triangle Barycentrics – Fixed Point” section in Nicolas Guillemot’s 2016 blog post.

Clip near and far, maybe

If your Z buffer resolution is tiny like mine and you do frustum culling then the guard-band is effectively huge so you can get away with just clipping to near and far planes. See “Guard-band clipping” in this post if you don’t know what I’m talking about.

It’s tempting to skimp on clipping and just discard any occluder that touches the near plane. This is not a profitable strategy because you’ll discard perfectly valid occluders that just happen to pass through the near plane outside the camera view, like this:

The same applies to large objects: you can’t declare them visible just because their bounding box touched the near plane.

Triangles share edges with each other. You should order the edge vertices consistently before intersection with a clipping plane, otherwise you get cracks due to inaccuracies. In practice it’s enough to always flip the edge so that the vertex inside the viewport comes first. See section 8.3.5 “More on Polygon Splitting Robustness” in the Realtime Collision Detection book.

Conservative rasterization woes

This means either inner-conservative or outer-conservative rasterization where you fill pixels fully inside or all pixels touching the triangle, respectively. See the 2016 thesis mentioned in the beginning.

For occluders shrinking silhouttes is needed for reliability because otherwise they might cover objects that are slightly peeking behind them in the actual render. If this is done with conservative rasterization and not with shrinking occluder geometry, finding silhouette edges is a necessity. See the comparison below.

If all occluder triangles are rendered inner-conservatively quads will show cracks on their diagonal (right). This wall consists of three convex quad occluders.

This feature was an endless source of tiny problems.

In order to shrink or inflate silhouettes, some kind of logic is needed to detect them. It’s straightforward for convex objects if you know how – find all edges that connect a front-facing triangle to a back-facing one – but of course you need a data structure to store this info in. More work for the graphics coder!

Also, clipping often produces new edges, so the triangle neighbor info computed at startup is not valid anymore. In other words, even if an edge was part of the silhouette before clipping, it may not be afterwards and you need to deal with it somehow. I dealt with it by simply not doing conservative rasterization of any clipped triangles :)

Also, outer-conservative rasterization (inflation) will cause interpolants to “overshoot” if you will. Since it by definition fills pixels with their centroids outside the exact triangle bounds, you will end up computing values that are outside of value range of your interpolant. In practice this means a near clipped triangle with a vertex at Z=0 will produce values with negative depths. Good luck storing that in your UINT16 Z-buffer!

The third one is a small one that took me a while to track down. Two-sided triangles can be implemented by flipping vertex order of back-facing triangles but edge identities also change when you do that. If the code earlier had already marked silhouette edges, you need to rotate those flags accordingly. Otherwise back-facing triangles will show cracks because the wrong edges get drawn conservatively.

For reference, this is how I compute the conservative edge biases.
// Returns a normal vector pointing to the left of line segment ab. In screen space where y axis grows downwards.
vec2 get_edge_normal(vec2 a, vec2 b)
{
    return (vec2){ b.y - a.y, -(b.x - a.x) };
}

// See Tomas Akenine-Möller and Timo Aila, "A Simple Algorithm for Conservative and Tiled Rasterization", 2005.
int compute_conservative_edge_bias(vec2 a, vec2 b, bool shrink)
{
    // normal points inside the triangle, or the left of line segment 'ab'
    vec2 n = get_edge_normal(a, b);
    // The division by 2 is here because the results looked better.
    int edge_bias = (abs(n.x) + abs(n.y)) / 2; 

    if (shrink) {
        return -edge_bias;
    } else {
        return edge_bias;
    }
}

with

typedef struct vec2_s {
    int x, y;
} vec2;
Then add that bias to the per-edge values you already computed for fill rules.

Reverse-Z doesn’t save you

Finally, reverse-Z will not give you better precision you if you have an integer Z-buffer. It’s just a Zreverse = 1 − Z transform and helps to distribute floating point precision more evenly.

Good resources for the brave software rasterization warrior

Thanks to mankeli for reading the first draft of this essay.

Appendix: Plane equation vs finite differences

I refer to Fabian Giesen’s blog post listed above. The function I mean is the λ(p)k function for each triangle edge k that returns the interpolated Z for a screen coordinate p. Then the X delta for the first edge would be dZ/dX = λ(p+(1,0))0 − λ(p)0, that’s a finite difference between two pixels.

I wrote some code the verify that I get the same results as with the plane equation route and didn’t simplify the equation algebraically.

Code

Here’s an excerpt with the draw_tri and occ_draw_indexed_mesh_flags functions of my renderer. It’s in C, for the Nintendo 64, and a bit naive but I left it here to answer any extremely detailed questions that may have arisen in the critical Internet reader that is the today’s youth.

Constants and flags
#define SUBPIXEL_BITS (2)

#if SUBPIXEL_BITS == 0
#define SUBPIXEL_ROUND_BIAS (0)
#else
#define SUBPIXEL_ROUND_BIAS (1<<(SUBPIXEL_BITS-1))
#endif

#define SUBPIXEL_SCALE (1 << SUBPIXEL_BITS)

#define DELTA_BITS (24) // It seems 24 bits give about 1/8 mean per-pixel error of 16 bits deltas.
#define DELTA_SCALE (1<<DELTA_BITS)

#define FLOAT_TO_FIXED32(f) (int32_t)(f * 0x10000)
#define FLOAT_TO_FIXED32_ROUND(f) (int32_t)(f * 0x10000 + 0.5f)
#define FLOAT_TO_U16(f) (uint16_t)(f * 0x10000)
#define U16_TO_FLOAT(u) ((float)u * 0.0002442002f) // Approximately f/0xffff

#define OCC_MAX_Z (0xffff)

#define ZBUFFER_UINT_PTR_AT(zbuffer, x, y) ((u_uint16_t *)(zbuffer->buffer + (zbuffer->stride * y + x * sizeof(uint16_t))))

#define DUMP_WHEN_Z_OVERFLOWS 1
#define SCREENSPACE_BIAS (-0.50f) // an empirical bias for (x,y) screenspace coordinates to make them cover OpenGL drawn pixels

extern bool config_shrink_silhouettes; // detect edges with flipped viewspace Z signs in each neighbor and add inner conservative flags
extern bool config_discard_based_on_tr_code;
extern bool config_inflate_rough_bounds;
extern bool config_report_near_clip_as_visible; // if queried polygons clip the near plane, always report them as visible
extern int config_force_rough_test_only; // return screen space bounding box test result directly

enum {
    RASTER_FLAG_BACKFACE_CULL = 1,
    RASTER_FLAG_EARLY_OUT = 1 << 1,         // Return when the first depth pass is found
    RASTER_FLAG_NEG_SLOPE_BIAS = 1 << 2,    // Negative slope bias, nudge closer, minimum per-pixel depth
    RASTER_FLAG_POS_SLOPE_BIAS = 1 << 3,    // Positive slope bias, nudge farther, maximum per-pixel depth
    RASTER_FLAG_ROUND_DEPTH_UP = 1 << 4,    // Round depths to the next higher 16-bit integer. Default is rounding down.
    RASTER_FLAG_DISCARD_FAR = 1 << 5,       // Discard any triangle that touches the far plane.
    RASTER_FLAG_WRITE_DEPTH = 1 << 6,       // Actually write to depth buffer
    RASTER_FLAG_REPORT_VISIBILITY = 1 << 7, // Write out visible pixel, combined with EARLY_OUT for testing-only mode
    RASTER_FLAG_SHRINK_EDGE_01 = 1 << 8,    // Inner-conservative rasterization
    RASTER_FLAG_SHRINK_EDGE_12 = 1 << 9,
    RASTER_FLAG_SHRINK_EDGE_20 = 1 << 10,
    RASTER_FLAG_EXPAND_EDGE_01 = 1 << 11,   // Outer-conservative rasterization
    RASTER_FLAG_EXPAND_EDGE_12 = 1 << 12,
    RASTER_FLAG_EXPAND_EDGE_20 = 1 << 13,
    RASTER_FLAG_RESERVED14 = 1 << 14,
    RASTER_FLAG_RESERVED15 = 1 << 15,
};

typedef uint32_t occ_raster_flags_t;

#define RASTER_FLAG_MASK_SHRINK (RASTER_FLAG_SHRINK_EDGE_01 | RASTER_FLAG_SHRINK_EDGE_12 | RASTER_FLAG_SHRINK_EDGE_20)

enum {
    OCCLUDER_TWO_SIDED = 1,
    OCCLUDER_TEST_FIRST = 1 << 1,
};
typedef uint32_t occ_occluder_flags_t;

#define OCC_RASTER_FLAGS_DRAW  (RASTER_FLAG_BACKFACE_CULL | RASTER_FLAG_WRITE_DEPTH |RASTER_FLAG_ROUND_DEPTH_UP)
#define OCC_RASTER_FLAGS_QUERY (RASTER_FLAG_BACKFACE_CULL | RASTER_FLAG_EARLY_OUT | RASTER_FLAG_REPORT_VISIBILITY | RASTER_FLAG_EXPAND_EDGE_01 | RASTER_FLAG_EXPAND_EDGE_12 | RASTER_FLAG_EXPAND_EDGE_20)
Rasterizer implementation
static bool isTopLeftEdge(vec2 a, vec2 b)
{
    // "In a counter-clockwise triangle, a top edge is an edge that is exactly horizontal
    //  and goes towards the left, i.e. its end point is left of its start point."
    if (a.y == b.y && b.x < a.x) return true;

    // "In a counter-clockwise triangle, a left edge is an edge that goes down,
    //  i.e. its end point is strictly below its start point."

    // But note that on the screen Y axis grows downwards so we check if the end point
    // Y coordinate is greater than start point Y.
    if (b.y > a.y) return true;
    return false;
}

static float orient2df(float* a, float* b, float* c)
{
    return ((b[0] - a[0]) * (c[1] - a[1]) - (b[1] - a[1]) * (c[0] - a[0]));
}

static int orient2d(vec2 a, vec2 b, vec2 c)
{
    return ((b.x - a.x) * (c.y - a.y) - (b.y - a.y) * (c.x - a.x));
}

static int orient2d_subpixel(vec2 a, vec2 b, vec2 c)
{
    // We multiply two I.F fixed point numbers resulting in (I-F).2F format,
    // so we shift by F to the right to get the the result in I.F format again.
    return orient2d(a, b, c) >> SUBPIXEL_BITS;
}

// Returns a normal vector pointing to the left of line segment ab. In screen space where y axis grows downwards.
static vec2 get_edge_normal(vec2 a, vec2 b)
{
    return (vec2){ b.y - a.y, -(b.x - a.x) };
}

static int compute_conservative_edge_bias(vec2 a, vec2 b, bool shrink)
{
    // See Tomas Akenine-Möller and Timo Aila, "A Simple Algorithm for Conservative and Tiled Rasterization", 2005.
    vec2 n = get_edge_normal(a, b);            // normal points inside the triangle, or the left of line segment 'ab'
    int edge_bias = (abs(n.x) + abs(n.y)) * 0.75; // factor chosen empirically to keep low-rez pixels inside high-rez RDP bounds
    if (shrink) {
        return -edge_bias;
    } else {
        return edge_bias;
    }
}

void draw_tri(
    vec2f v0f,
    vec2f v1f,
    vec2f v2f,
    float Z0f,
    float Z1f,
    float Z2f,
    occ_raster_flags_t flags,
    occ_raster_query_result_t* result,
    surface_t *zbuffer)
{
    const bool super_verbose = false;

    // The screenspace bias is added to empirically match SW depth map pixels to hi-rez RDP picture.
    vec2 v0 = {SUBPIXEL_SCALE * (v0f.x+SCREENSPACE_BIAS) + 0.5f, SUBPIXEL_SCALE * (v0f.y+SCREENSPACE_BIAS) + 0.5f};
    vec2 v1 = {SUBPIXEL_SCALE * (v1f.x+SCREENSPACE_BIAS) + 0.5f, SUBPIXEL_SCALE * (v1f.y+SCREENSPACE_BIAS) + 0.5f};
    vec2 v2 = {SUBPIXEL_SCALE * (v2f.x+SCREENSPACE_BIAS) + 0.5f, SUBPIXEL_SCALE * (v2f.y+SCREENSPACE_BIAS) + 0.5f};

    int area2x = -orient2d_subpixel(v0, v1, v2);

    if (false) {
        debugf("area2x int: %d\n", area2x);
        // debugf("minb: (%d, %d), maxb: (%d, %d), size: %dx%d\n", minb.x, minb.y, maxb.x, maxb.y, maxb.x-minb.x, maxb.y-minb.y);
        debugf("v0f: (%f, %f), v1f: (%f, %f), v2f: (%f, %f)\n",
               v0f.x, v0f.y, v1f.x, v1f.y, v2f.x, v2f.y);
        debugf("v0: (%d, %d), v1: (%d, %d), v2: (%d, %d)\n",
               v0.x, v0.y, v1.x, v1.y, v2.x, v2.y);

        // debugf("v01: (%f, %f, %f), v02: (%f, %f, %f)\n", v01.x, v01.y, v01.z, v02.x, v02.y, v02.z);
    }

    if (area2x <= 0) {
        if (flags & RASTER_FLAG_BACKFACE_CULL) {
            return;
        } else if (area2x != 0) {
            SWAP(v1, v2);
            SWAP(Z1f, Z2f);
            area2x = -area2x;
            // Swap shrink flags now that edge identities are different
            occ_raster_flags_t new = flags & ~RASTER_FLAG_MASK_SHRINK;
            new |= (flags & RASTER_FLAG_SHRINK_EDGE_01) << 2; // swap 01 -> 02=20
            new |= (flags & RASTER_FLAG_SHRINK_EDGE_20) >> 2; // swap 20 -> 10=01
            flags = new;
        } else {
            return;
        }
    }

    vec2 minb = {
        (min(v0.x, min(v1.x, v2.x)) >> SUBPIXEL_BITS),
        (min(v0.y, min(v1.y, v2.y)) >> SUBPIXEL_BITS)
        };

    vec2 maxb = {
        ((max(v0.x, max(v1.x, v2.x)) + SUBPIXEL_SCALE-1) >> SUBPIXEL_BITS),
        ((max(v0.y, max(v1.y, v2.y)) + SUBPIXEL_SCALE-1) >> SUBPIXEL_BITS)
        };

    if (minb.x < 0) minb.x = 0;
    if (minb.y < 0) minb.y = 0;
    if (maxb.x > zbuffer->width - 1) maxb.x = zbuffer->width - 1;
    if (maxb.y > zbuffer->height - 1) maxb.y = zbuffer->height - 1;

    if (minb.x >= maxb.x || minb.y >= maxb.y) return;

    vec2 p_start = { minb.x << SUBPIXEL_BITS, minb.y << SUBPIXEL_BITS };

    int A01 = -(v0.y - v1.y), B01 = -(v1.x - v0.x);
    int A12 = -(v1.y - v2.y), B12 = -(v2.x - v1.x);
    int A20 = -(v2.y - v0.y), B20 = -(v0.x - v2.x);

    int w0_row = -orient2d_subpixel(v1, v2, p_start);
    int w1_row = -orient2d_subpixel(v2, v0, p_start);
    int w2_row = -orient2d_subpixel(v0, v1, p_start);

    int bias0 = isTopLeftEdge(v1, v2) ? 0 : -1;
    int bias1 = isTopLeftEdge(v2, v0) ? 0 : -1;
    int bias2 = isTopLeftEdge(v0, v1) ? 0 : -1;

    // Adjust edge functions based on triangle edge normals.
    assert(!((flags & RASTER_FLAG_SHRINK_EDGE_12) && (flags & RASTER_FLAG_EXPAND_EDGE_12)));
    assert(!((flags & RASTER_FLAG_SHRINK_EDGE_20) && (flags & RASTER_FLAG_EXPAND_EDGE_20)));
    assert(!((flags & RASTER_FLAG_SHRINK_EDGE_01) && (flags & RASTER_FLAG_EXPAND_EDGE_01)));

    if (flags & (RASTER_FLAG_SHRINK_EDGE_12 | RASTER_FLAG_EXPAND_EDGE_12)) {
        bias0 += compute_conservative_edge_bias(v1, v2, flags & RASTER_FLAG_SHRINK_EDGE_12);
    }
    if (flags & (RASTER_FLAG_SHRINK_EDGE_20 | RASTER_FLAG_EXPAND_EDGE_20)) {
        bias1 += compute_conservative_edge_bias(v2, v0, flags & RASTER_FLAG_SHRINK_EDGE_20);
    }
    if (flags & (RASTER_FLAG_SHRINK_EDGE_01 | RASTER_FLAG_EXPAND_EDGE_01)) {
        bias2 += compute_conservative_edge_bias(v0, v1, flags & RASTER_FLAG_SHRINK_EDGE_01);
    }

    w0_row += bias0;
    w1_row += bias1;
    w2_row += bias2;

    // Prepare Z deltas
    // Prepare inputs to a formula solved via a 3D plane equation with subpixel XY coords and Z.
    // See https://tutorial.math.lamar.edu/classes/calciii/eqnsofplanes.aspx
    // and the "Interpolated values" section at
    // https://fgiesen.wordpress.com/2011/07/08/a-trip-through-the-graphics-pipeline-2011-part-7/

    vec3f v01 = vec3f_sub((vec3f){v1.x, v1.y, Z1f}, (vec3f){v0.x, v0.y, Z0f});
    vec3f v02 = vec3f_sub((vec3f){v2.x, v2.y, Z2f}, (vec3f){v0.x, v0.y, Z0f});

    vec3f N = cross3df(v01, v02);
    N.z *= inv_subpixel_scale; // Scale back the fixed point scale multiply inside cross3df

    // N is now in subpixel scale, divide again to bring it to per-pixel scale
    N.x *= inv_subpixel_scale;
    N.y *= inv_subpixel_scale;
    N.z *= inv_subpixel_scale;
    float dZdx = -N.x / N.z;
    float dZdy = -N.y / N.z;

    // Compute Z value at the starting pixel at (minb.x, minb.y). It's computed by extrapolating the Z value at vertex 0.
    float Zf_row = Z0f
        + ((minb.x * SUBPIXEL_SCALE - v0.x) / SUBPIXEL_SCALE) * dZdx
        + ((minb.y * SUBPIXEL_SCALE - v0.y) / SUBPIXEL_SCALE) * dZdy;


    // Write a constant depth for small triangles.
    // It's a a workaround for small triangle delta precision problems.
    // Small triangle area is less than one pixel^2.
    const int min_triangle_area = SUBPIXEL_SCALE * SUBPIXEL_SCALE * 2;
    const bool is_small = abs(area2x) < min_triangle_area;

    if (is_small) {
        dZdx = 0.f;
        dZdy = 0.f;
        if (flags & RASTER_FLAG_WRITE_DEPTH) {
            // Writing a depth? Make it conservative by writing farthest depth.
            Zf_row = max(Z0f, max(Z1f, Z2f));
        } else {
            // Testing the depth? Test only the closest vertex value.
            Zf_row = min(Z0f, min(Z1f, Z2f));
        }
    }

    if (false) {
        debugf("area2x: %d\n", area2x);
        debugf("Z0f: %f, Z1f: %f, Z2f: %f\n", Z0f, Z1f, Z2f);
        debugf("Zf_row: %f, dZdx: %f, dZdy: %f\n", Zf_row, dZdx, dZdy);
        debugf("minb: (%d, %d), maxb: (%d, %d), size: %dx%d\n", minb.x, minb.y, maxb.x, maxb.y, maxb.x - minb.x, maxb.y - minb.y);
        debugf("v0f: (%f, %f), v1f: (%f, %f), v2f: (%f, %f)\n", v0f.x, v0f.y, v1f.x, v1f.y, v2f.x, v2f.y);
        debugf("v0: (%d, %d), v1: (%d, %d), v2: (%d, %d)\n", v0.x, v0.y, v1.x, v1.y, v2.x, v2.y);
        debugf("v01: (%f, %f, %f), v02: (%f, %f, %f)\n", v01.x, v01.y, v01.z, v02.x, v02.y, v02.z);
    }

    // Fixed point deltas for the integer-only inner loop. We use DELTA_BITS of precision.
    int32_t Z_row_fixed = (int32_t)(DELTA_SCALE * Zf_row);
    int32_t dZdx_fixed = (int32_t)(DELTA_SCALE * dZdx);
    int32_t dZdy_fixed = (int32_t)(DELTA_SCALE * dZdy);
    int32_t max_Z_fixed = (int32_t)(DELTA_SCALE * 1.0f) - 1;

    // Problem: Negative biases make queries super conservative and you can see objects behind walls!
    if (flags & RASTER_FLAG_NEG_SLOPE_BIAS) {
        // We can make sure each interpolated depth value is less or equal to the actual triangle min depth with this bias.
        // Note: this will produce values one-pixel's worth under the triangle's original depth range.
        // See MaxDepthSlope in https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#DepthBias
        Z_row_fixed -= max(abs(dZdx_fixed), abs(dZdy_fixed));
    } else if (flags & RASTER_FLAG_POS_SLOPE_BIAS) {
        Z_row_fixed += max(abs(dZdx_fixed), abs(dZdy_fixed));
    }

    assert(!((flags & RASTER_FLAG_NEG_SLOPE_BIAS) && (flags & RASTER_FLAG_POS_SLOPE_BIAS))); // Mutually exclusive flags

    if (super_verbose) {
        debugf("Z_row = %f, Z_row_fixed = %ld\n", Zf_row, Z_row_fixed);
    }

    // Only 'p', 'minb' and 'maxb' are in whole-pixel coordinates here. Others are all in sub-pixel coordinates.
    vec2 p = {-1, -1};

    for (p.y = minb.y; p.y <= maxb.y; p.y++) {
        int32_t w0 = w0_row;
        int32_t w1 = w1_row;
        int32_t w2 = w2_row;
        int32_t Z_incr_fixed = Z_row_fixed;

        for (p.x = minb.x; p.x <= maxb.x; p.x++) {
            if (super_verbose) {
                debugf("| %s (%-2d, %-2d) %-8f ", (w0 | w1 | w2) >= 0 ? "X" : ".", p.x, p.y, Z_incr_fixed/(float)DELTA_SCALE);
            }
            if ((w0 | w1 | w2) >= 0) {
                int bias = 0;

                if (flags & RASTER_FLAG_ROUND_DEPTH_UP) {
                    bias = (1 << DELTA_BITS) - 1;
                }

                uint16_t depth;

                if (Z_incr_fixed < max_Z_fixed) {
                    depth = (Z_incr_fixed + bias) >> (DELTA_BITS - 16);
                } else {
                    depth = 0xffff - 1;
                }

                u_uint16_t *buf = ZBUFFER_UINT_PTR_AT(zbuffer, p.x, p.y);

                if (depth < *buf) {
                    if (flags & RASTER_FLAG_WRITE_DEPTH) {
                        *buf = depth;
                    }
                    if (flags & RASTER_FLAG_REPORT_VISIBILITY) {
                        assert(result);
                        result->visible = true;
                        result->x = p.x;
                        result->y = p.y;
                        result->depth = depth;
                        if (g_verbose_early_out) {
                            debugf("\nvisible (%d < %d) at (%d, %d), v0=(%d,%d)\n", depth, *buf, p.x, p.y, v0.x>>SUBPIXEL_BITS, v0.y>>SUBPIXEL_BITS);
                        }
                        if (super_verbose) {
                            debugf("Z_incr_fixed: %ld\n", Z_incr_fixed);
                            debugf("relative: (%d, %d)\n", p.x - minb.x, p.y - minb.y);
                            debugf("minb: (%d, %d), maxb: (%d, %d), size: %dx%d\n", minb.x, minb.y, maxb.x, maxb.y, maxb.x-minb.x, maxb.y-minb.y);
                            debugf("z0f: %f, z1f: %f, z2f: %f\n", Z0f, Z1f, Z2f);
                            debugf("v0f: (%f, %f), v1f: (%f, %f), v2f: (%f, %f)\n",
                                   v0f.x, v0f.y, v1f.x, v1f.y, v2f.x, v2f.y);
                            debugf("v0: (%d, %d), v1: (%d, %d), v2: (%d, %d)\n",
                                   v0.x, v0.y, v1.x, v1.y, v2.x, v2.y);

                            debugf("v01: (%f, %f, %f), v02: (%f, %f, %f)\n",
                                   v01.x, v01.y, v01.z, v02.x, v02.y, v02.z);
                            debugf("N: (%f, %f, %f)\n",
                                   N.x, N.y, N.z);
                            debugf("dZdx, dZdy: (%f, %f)\n", dZdx, dZdy);
                            debugf("dZdx_fixed, dZdy_fixed: (%ld, %ld)\n", dZdx_fixed, dZdy_fixed);

                            debugf("zf_row2: %f\n", Zf_row);
                            float lambda0 = (float)(w0 - bias0) / area2x;
                            float lambda1 = (float)(w1 - bias1) / area2x;
                            float lambda2 = (float)(w2 - bias2) / area2x;
                            float Zf_bary = lambda0 * Z0f + lambda1 * Z1f + lambda2 * Z2f;
                            debugf("Zf_bary = %f; L0=%f; L1=%f; L2=%f;\n", Zf_bary, lambda0, lambda1, lambda2);
                            debugf("w0=%ld; w1=%ld; w2=%ld;\n", w0, w1, w2);

                            debugf("A01 = %d; B01 = %d\n", A01, B01);
                            debugf("A12 = %d; B12 = %d\n", A12, B12);
                            debugf("A20 = %d; B20 = %d\n", A20, B20);
                            debugf("area2x: %d\n", area2x);
                            debugf("\n");
                            debugf("set_grid(min=(%d, %d), max=(%d, %d))\n", minb.x, minb.y, maxb.x+1, maxb.y+1);
                            debugf("draw_grid()\n");
                            debugf("draw_tri([(%f, %f), (%f, %f), (%f, %f)])\n",
                                   v0f.x, v0f.y, v1f.x, v1f.y, v2f.x, v2f.y);

                            while (true) {}; // HACK
                        }
                    }

                    if (flags & RASTER_FLAG_EARLY_OUT) {
                        return;
                    }
                }

            }

            w0 += A12;
            w1 += A20;
            w2 += A01;
            Z_incr_fixed += dZdx_fixed;
        }

        w0_row += B12;
        w1_row += B20;
        w2_row += B01;
        Z_row_fixed += dZdy_fixed;

        if (super_verbose) { debugf("\n"); }
    }
}

#define OCC_MAX_MESH_VERTEX_COUNT (24) // enough for a cube with duplicated verts
#define OCC_MAX_MESH_INDEX_COUNT (30)

float matrix_mult_z_only(const matrix_t *m, const float *v)
{
    return m->m[0][2] * v[0] + m->m[1][2] * v[1] + m->m[2][2] * v[2] + m->m[3][2] * v[3];
}

void occ_draw_indexed_mesh_flags(occ_culler_t *occ, surface_t *zbuffer, const matrix_t *model_xform, const occ_mesh_t* mesh,
                                cpu_vtx_t* cached_cs_verts,
                                uint16_t* tri_neighbors, occ_target_t* target, const occ_raster_flags_t flags,
                                occ_raster_query_result_t* query_result)
{
    assert(zbuffer->width == occ->viewport.width);
    assert(zbuffer->height == occ->viewport.height);

    if (query_result) {
        query_result->visible = false;
    }

    // Transform all vertices first
    prof_begin(REGION_TRANSFORM);

    prof_begin(REGION_TRANSFORM_MVP);
    matrix_t* mvp = &occ->mvp;
    matrix_t* modelview = NULL;
    matrix_t mvp_new, modelview_new;

    if (!model_xform) {
        // No per-object global transform: use defaults
        //debugf("using default for mesh %p\n", mesh);
        mvp = &occ->mvp;
        modelview = &occ->view_matrix;
    } else {
        // Otherwise compute new ModelView and MVP matrices
        //debugf("computing for mesh %p\n", mesh);
        //print_matrix(model_xform);
        mvp = &mvp_new;
        modelview = &modelview_new;
        matrix_mult_full(mvp, &occ->mvp, model_xform);
        matrix_mult_full(modelview, &occ->view_matrix, model_xform);
    }
    prof_end(REGION_TRANSFORM_MVP);

    prof_begin(REGION_TRANSFORM_DRAW);

    cpu_vtx_t all_verts[OCC_MAX_MESH_VERTEX_COUNT];
    bool tri_faces_camera[OCC_MAX_MESH_INDEX_COUNT];
    bool tri_faces_camera_has_data = false;

    for (uint32_t i = 0; i < mesh->num_vertices; i++) {
        all_verts[i].obj_attributes.position[0] = mesh->vertices[i].position[0];
        all_verts[i].obj_attributes.position[1] = mesh->vertices[i].position[1];
        all_verts[i].obj_attributes.position[2] = mesh->vertices[i].position[2];
        all_verts[i].obj_attributes.position[3] = 1.0f; // Q: where does cpu pipeline set this?

        if (cached_cs_verts) {
            all_verts[i] = cached_cs_verts[i];
        } else {
            cpu_vertex_pre_tr(&all_verts[i], mvp);
        }
        cpu_vertex_calc_screenspace(&all_verts[i]);
    }

    int num_tris = mesh->num_indices/3;

    // Compute signed triangle areas only for occluders
    if (tri_neighbors && config_shrink_silhouettes) {
        for (int i = 0; i < num_tris; i++) {
            float *a = &all_verts[mesh->indices[i * 3 + 0]].screen_pos[0];
            float *b = &all_verts[mesh->indices[i * 3 + 1]].screen_pos[0];
            float *c = &all_verts[mesh->indices[i * 3 + 2]].screen_pos[0];
            float area = -orient2df(a, b, c);
            tri_faces_camera[i] = area > 0;
            // debugf("tri %d facing: %d with area: %f\n", i, tri_faces_camera[i], area);
        }

        tri_faces_camera_has_data = true;
    }

    prof_end(REGION_TRANSFORM_DRAW);
    prof_end(REGION_TRANSFORM);

    int num_tris_drawn = 0;
    int ofs = 0;
    if (target) {
        ofs = target->last_visible_idx; // start from where we last found a visible pixel
        target->last_visible_idx = 0;
    }

    for (int draw_idx = 0; draw_idx < mesh->num_indices; draw_idx += 3) {
        int idx = (draw_idx + ofs) % mesh->num_indices; // start from 'ofs' but render the whole mesh
        const uint16_t *inds = &mesh->indices[idx];
        cpu_vtx_t verts[3] = {all_verts[inds[0]], all_verts[inds[1]], all_verts[inds[2]]};
        cpu_vtx_t clipping_cache[CLIPPING_CACHE_SIZE];
        cpu_clipping_list_t clipping_list = {.count = 0};

        if (g_verbose_setup) {
            debugf("pos=(%f, %f, %f, %f), cs_pos=(%f, %f, %f, %f), tr_code=%d\n",
                   verts[0].obj_attributes.position[0],
                   verts[0].obj_attributes.position[1],
                   verts[0].obj_attributes.position[2],
                   verts[0].obj_attributes.position[3],
                   verts[0].cs_pos[0],
                   verts[0].cs_pos[1],
                   verts[0].cs_pos[2],
                   verts[0].cs_pos[3],
                   verts[0].tr_code);
            debugf("screen_pos: (%f, %f), depth=%f, inv_w=%f\n",
                   verts[0].screen_pos[0],
                   verts[0].screen_pos[1],
                   verts[0].depth,
                   verts[0].inv_w);
        }


        uint8_t tr_codes = 0xff;
        tr_codes &= verts[0].tr_code;
        tr_codes &= verts[1].tr_code;
        tr_codes &= verts[2].tr_code;

        // Trivial rejection
        if (tr_codes) {
            continue;
        }

        uint32_t edge_flag_mask = 0;

        int tri_idx = idx/3;

        if (false) {
            // Workaround: Disable early backface culling because it seems unreliable near the screen edges.
            if (flags & RASTER_FLAG_BACKFACE_CULL) {
                assert(tri_faces_camera_has_data);
                if (!tri_faces_camera[tri_idx]) {
                    continue;
                }
            }

            if (!(flags & RASTER_FLAG_BACKFACE_CULL)) {
                debugf("is=%d, tri_idx=%d, faces=%d\n", draw_idx, tri_idx, tri_faces_camera[tri_idx]);
            }
        }

        if (tri_neighbors && config_shrink_silhouettes) {
            assert(tri_faces_camera_has_data);

            // Silhouette edges join triangles that face different directions
            bool this_facing = tri_faces_camera[tri_idx];

            for (int j = 0; j < 3; j++) {
                uint16_t other = tri_neighbors[tri_idx * 3 + j];
                if (other != OCC_NO_EDGE_NEIGHBOR) {
                    if (this_facing != tri_faces_camera[other]) {
                        edge_flag_mask |= (RASTER_FLAG_SHRINK_EDGE_01 << j);
                        break;
                    }
                }
            }
        }

        const bool clips_near = (verts[0].tr_code | verts[1].tr_code | verts[2].tr_code) & (1 << NEAR_PLANE_INDEX);
        const bool clips_far = (verts[0].tr_code | verts[1].tr_code | verts[2].tr_code) & (1 << FAR_PLANE_INDEX);

        if (config_report_near_clip_as_visible) {
            // If we are drawing for a precise query and we hit a near clip: report as visible
            if (clips_near && (flags & RASTER_FLAG_REPORT_VISIBILITY) && !(flags & RASTER_FLAG_WRITE_DEPTH)) {
                assert(query_result && "must pass in a non-NULL query_result if asking for a visibility result");
                if (query_result) {
                    query_result->visible = true;
                    query_result->num_tris_drawn = num_tris_drawn;
                    return;
                }
            }
        }

        if (clips_near && g_near_clipping_action == CLIP_ACTION_REJECT) {
            continue;
        } 

        assert(g_near_clipping_action == CLIP_ACTION_DO_IT);

        if (clips_near || clips_far) {
            // tr_code   = clip against screen bounds, used for rejection
            // clip_code = clipped against guard bands, used for actual clipping
            //
            // We clip only against the near and far plane so they are the same.

            verts[0].clip_code = verts[0].tr_code;
            verts[1].clip_code = verts[1].tr_code;
            verts[2].clip_code = verts[2].tr_code;

            const uint8_t plane_mask = (1 << NEAR_PLANE_INDEX) | (1 << FAR_PLANE_INDEX);
            cpu_gl_clip_triangle(&verts[0], &verts[1], &verts[2], plane_mask, clipping_cache, &clipping_list);

            // Workaround: Don't shrink edges. Clipping invalidates silhoutte checks we made above so just rasterize normally
            //             to avoid cracks. This hack makes occluders cover too many pixels which may lead to culling errors.
            edge_flag_mask = 0;
        }

        if (clips_far && (flags & RASTER_FLAG_DISCARD_FAR)) {
            // Reject triangles that touch the far clip plane when rendering occluders. We can't store their farthest depth anyway.
            continue;
        }

        if (clipping_list.count == 0) {
            prof_begin(REGION_RASTERIZATION);
            draw_tri(
                (vec2f){verts[0].screen_pos[0], verts[0].screen_pos[1]},
                (vec2f){verts[1].screen_pos[0], verts[1].screen_pos[1]},
                (vec2f){verts[2].screen_pos[0], verts[2].screen_pos[1]},
                verts[0].depth, verts[1].depth, verts[2].depth,
                flags | edge_flag_mask, query_result,
                zbuffer);
            prof_end(REGION_RASTERIZATION);
            num_tris_drawn++;
        } else {
            prof_begin(REGION_RASTERIZATION);
            for (uint32_t i = 1; i < clipping_list.count; i++) {
                vec2f sv[3];
                sv[0].x = clipping_list.vertices[0]->screen_pos[0];
                sv[0].y = clipping_list.vertices[0]->screen_pos[1];
                sv[1].x = clipping_list.vertices[i - 1]->screen_pos[0];
                sv[1].y = clipping_list.vertices[i - 1]->screen_pos[1];
                sv[2].x = clipping_list.vertices[i]->screen_pos[0];
                sv[2].y = clipping_list.vertices[i]->screen_pos[1];

                draw_tri(
                    sv[0], sv[1], sv[2],
                    clipping_list.vertices[0]->depth, clipping_list.vertices[i - 1]->depth, clipping_list.vertices[i]->depth,
                    flags | edge_flag_mask, query_result,
                    zbuffer);
                num_tris_drawn++;

                if ((flags & RASTER_FLAG_EARLY_OUT) && query_result && query_result->visible) {
                    break;
                }
            }
            prof_end(REGION_RASTERIZATION);
        }

        if (query_result) {
            query_result->num_tris_drawn = num_tris_drawn;
            if (g_verbose_visibility_tracking) {
                debugf("visible: %d\n", query_result->visible);
            }
        }

        // Early out from all triangles even if only one of them was visible
        if ((flags & RASTER_FLAG_EARLY_OUT) && query_result && query_result->visible) {
            target->last_visible_idx = idx;
            if (target && g_verbose_visibility_tracking) {
                debugf("was visible at wrapped_is = %d = (%d+%d) %% %lu\n", idx, draw_idx, ofs, mesh->num_indices);
            }
            return;
        }
    }
}