zig Proposal: do not include guaranteed-zero bits in packed representation of pointers

Background

Zig allows pointer types to have an alignment specified; a power-of-two value which the address (if defined) is guaranteed to be a multiple of. A pointer's alignment defaults to the child type's "natural alignment", and can be overriden with the syntax *align(n) T.

For well-defined pointers whose type has an alignment of 1 << n, we know by definition that the least significant n bits of the value must always be 0. This makes those bits effectively redundant: we can reconstruct them without storing them in memory. In a sense, they are akin to padding in structs; necessary for the CPU to easily operate on the value (all ISAs I know of always work on "full" pointers), but logically redundant.

Proposal

When storing pointers in a packed struct or packed union, omit bits which are known to be zero based on the pointer's alignment. That is, for an alignment of 1 << n, do not store the least significant n bits.

The size of a pointer outside of these datastructures is unaffected: @sizeOf(P) for any pointer type P always compares equal to @sizeOf(usize) like today, and "plain" pointers are always ABI-compatible (the low bits are zeroed as you would expect). Only when stored in a packed aggregate do we eliminate these extra bits, akin to how we eliminate padding bits from types like bool.

The main motivation for this proposal is pointer tagging. It's not uncommon for certain datastructures etc to repurpose some bits of pointer values to store other state; for instance, if storing a pointer with alignment 2, the bottom bit can be repurposed to store a single boolean value (we just mask this bit out to get the actual pointer). Today, this kind of pattern would usually be achieved by using @intFromPtr and @ptrFromInt to store a usize rather than a pointer. However, this has solution has two issues. The first is one of usability: this can be a bit clunky to write correctly, requires manual bit twiddling, and adds redundant safety checks. The second, more severe, issue, is that this round trip usually loses all pointer provenance information. Pointer provenance can be highly important to good optimization, but casts between pointers and integers generally have to drop provenance information, since assigning integer values provenance prevents other important optimizations. For this reason, having to drop this information to use tag bits on pointers could be somewhat problematic. However, under this proposal, pointer tagging becomes trivial:

const S = struct {
    x: u16,
};

comptime {
    assert(@alignOf(S) == 2);
    assert(@bitSizeOf(*S) == @bitSizeOf(usize) - 1);
}

const PtrAndTag = packed struct(usize) {
    ptr: *S,
    tag: bool,
};

Because ptr is an actual pointer - just with a different in-memory representation - the optimizer can now forward provenance information just as before, potentially making code using this type optimize better. The actual guarantee this adds for the optimizer compared to the @intFromPtr method is that the address the pointer refers to is never modified - just some unrelated bits of memory which happen to live in the same bytes. (Note that provenance preservation here may be difficult to implement in our LLVM backend - I haven't looked into it - but even if this is true, the the option is still available for our own backends.)

Oct 10 '23 21:10 mlugg

How does this affect the ability to take a pointer to the pointer? Consider the following code:


const PtrAndTag = packed struct(usize) {
    ptr: *u16,
    tag: bool,
};

test "pointer to pointer" {
    var s: u16 = 69;
    var foo = PtrAndTag{.ptr = &s, .tag = true};
    var s_p = &foo.ptr;
    @compileLog(@TypeOf(s_p));
}

What is logged here?

Oct 11 '23 20:10 notcancername

How does this affect the ability to take a pointer to the pointer?

Hm, that's a fair question. A few options:

That example logs (assuming 64-bit pointers) *align(8:0:8) *u16 - this would already be the behavior today for a non-byte-aligned field like this. Here, the fact that the outer pointer is a bit-pointer implicitly changes the representation of the inner pointer. This kind of already happens today - a *align(1:0:1) u4 is mutated differently to a *u4 for instance, as the latter is (by my understanding) free to clobber the padding bits (I might be wrong) - but we're kind of doing more here, since the non-packed pointer representation requires zeroing those bits. Having thought about this further, I'm not sure whether I like this.
Amend the original proposal to only use this representation if the pointer has a new attribute (as in *compact_pointer_repr T). We have to be careful when adding a keyword here to not block a common name from being used in the language. We can't really use packed, since then *packed struct {} looks ambiguous. compact would be nice if it weren't for the fact that it's a common term which we probably don't want to reserve just for this.
Amend the original proposal to only use this representation if the pointer specifies some attribute within its align clause. I actually like this idea - the type *u16 would always be represented as today, but you could write *align(packed 2) u16 to use the compacted representation. This gets rid of the slightly tricky detail of this proposal where the representation of the pointer depends on where it's used, and it requires you to explicitly acknowledge the exact alignment (and hence difference in bit-size) of the pointer to opt-in to this behavior.

I quite like the third idea, and am strongly considering amending the original proposal to it. @SpexGuy, I'd be interested in hearing your thoughts.

Example under this amendment:

test {
    comptime assert(@bitSizeOf(*align(2) u8) == @bitSizeOf(usize));
    comptime assert(@bitSizeOf(*align(packed 2) u8) == @bitSizeOf(usize) - 1);
}

const PtrAndTag = packed struct(usize) {
    ptr: *align(packed 2) u16,
    tag: bool,
};
const TagAndPtr = packed struct(usize) {
    tag: bool,
    ptr: *align(packed 2) u16,
};
test {
    var pat: PtrAndTag = undefined;
    var tap: TagAndPtr = undefined;

    // assuming 64-bit pointers

    comptime assert(@TypeOf(&pat.ptr) == *align(8:0:8) *align(packed 2) u16);
    comptime assert(@TypeOf(&pat.tag) == *align(8:63:8) bool);

    comptime assert(@TypeOf(&tap.tag) == *align(8:0:8) bool);
    comptime assert(@TypeOf(&tap.ptr) == *align(8:1:8) *align(packed 2) u16);
}

Oct 21 '23 03:10 mlugg

It's worth considering a way to attach provenance to bits-from-a-pointer that's more general than this proposal. I'm thinking of both pointer compression and NaN tagging, to complete the set, which includes pointer tagging as well, of tricks VM writers like to do with pointers.

The goal statement here might be "given a pointer, a data structure storing that pointer, and a way to restore that pointer to usize, the compiler is aware of the provenance of the restored pointer".

May 21 '24 18:05 mnemnion