zig icon indicating copy to clipboard operation
zig copied to clipboard

Proposal: add `@memCast` for a class of safe pointer casts

Open mlugg opened this issue 7 months ago • 3 comments

Background

After the merge of #22706 and #23919, you can now @ptrCast to a slice where the operand is any slice or single-item pointer. The idea, broadly speaking, is that if we know the number of bytes the operand points to, we can make our result point to the same number of bytes.

This is incredibly convenient, and simplifies a lot of things; for instance, when you want to get the plain byte slice ([]u8) underlying a slice or a single value, you can do that with a simple @ptrCast.

fn write(buf: []const u8) !void { ... }

const x: u32 = someStuff();
try write(@ptrCast(&x));

const vals: []const u32 = readMoreStuff();
try write(@ptrCast(vals));

However, there's a slight problem here. As the author of this code, we know these @ptrCast calls are safe, because the destination type is a []const u8, so the number of elements is computed (by the compiler in the first case, or at runtime in the second case). However, if the destination type were something else, this would be unsafe. For instance, if write took a *u64, the first call would be unsafe, because when write loaded from buf, it would read more memory than x owns. To a reader, it's not immediately clear that this @ptrCast is safe without checking the signature of write; and such a bug could actually be introduced in a refactor if the signature of write changed.

Let's look at another common case. A useful pattern in Zig is to provide type-safe wrappers around integers using enums. One nice addition here is that when using certain Data-Oriented Design patterns, these enum values can be directly packed into a big array of untyped "miscellaneous data" (we tend to call this "extra data", and it's a pattern used in std.zig.Ast, std.zig.Zir, and many more places throughout the compiler); which has a whole bunch of benefits. However, one of the annoying things here is that if you store a sequence of enum values in this array, it's not trivial to get at them; slicing the array gives you an untyped []u32, so you then need to @ptrCast to a []Air.Inst.Index to get at them. This pointer cast, again, looks quite dangerous to the untrained eye -- but it's absolutely not! We're reinterpreting memory in a well-defined way. The @ptrCast is more of an alarm bell than we really need here -- but you can see why, because if the destination type were accidentally changed to something other than a slice (e.g. a *[100]TheEnum, or a [*]Something), the cast would suddenly be "unsafe" in the sense that the result might not be dereferenceable.

The common thread connecting the two cases above is that we don't need the "full power" of @ptrCast. It's true that we might change the pointer "size" (e.g. change from a single-item pointer to a slice) and the element type (e.g. u32 to u8), but what's important is that we are trying to safely reinterpret memory, in a way which is known to be well-defined from the type system alone. This turns out to be a particularly common operation in some cases (after all, we had std.mem.asBytes and std.mem.sliceAsBytes to do this before the new @ptrCast semantics!), so it would be nice if there was a safer way to express it.

Proposal

Introduce another "pointer cast"-style builtin:

@memCast(ptr: anytype) anytype

Like other pointer cast builtins, it infers its return type from the context's Result Type, and can be chained directly with other pointer cast builtins (e.g. @alignCast) to combine effects. However, this builtin is most likely to be used standalone.

The builtin acts as a variant of @ptrCast with the added constraint that the returned pointer (or slice) refers to the exact same amount of memory as the operand pointer (or slice). This means the operand type and result type must both be single-item pointers or slices (they cannot be many-item pointers or C pointers; also disallowed are pointers to anyopaque). If both are slices, there may be a runtime safety check (depending on @sizeOf the respective elements) to ensure that the element count divides neatly.

The builtin also requires that the result pointer type does not have an element type with an ill-defined layout. For instance, you cannot cast *align(@alignOf(S)) [@sizeOf(S)]u8 to *S with this builtin. The logic here is that such a cast is not "safe", in the sense that it would be Illegal Behavior to use the resulting pointer if the operand does not point to a valid S value. Given #2414, we could allow this cast and introduce a safety check when casting to a type with ill-defined layout, but it seems like the definition given here will be more useful in practice (since Illegal Behavior is kept to a minimum).

When combined, these constraints turn out to give quite nice guarantees! In particular, we have the following:

If the operand to @memCast is a dereferenceable pointer, and if @memCast does not itself hit Safety-Checked Illegal Behavior (due to an incompatible slice length), then it is guaranteed that the returned pointer is also dereferenceable. For slices, this applies to all in-bounds elements.

Okay, that's a bit wordy, because I was trying to be precise. Informally, the idea is: valid pointer in, valid pointer out. That pointer has just reinterpreted the existing memory in a definitely-legal way.

This proposal removes the ability for @ptrCast to ever return a slice; users who want that behavior should be using @memCast instead, because @ptrCast returning a slice always refers to the same number of bytes. So, @ptrCast must now return a non-slice pointer (single-item, many-item, or C). In other words, @ptrCast doesn't give you any safety guarantees in terms of the returned pointer being dereferenceable.

EDIT: this proposal also renames @ptrCast to @elemCast, to make its function clearer: it changes what a pointer "points to". Then, distinction between @memCast and @elemCast is that the former returns a pointer which refers to the same region of memory (hence "mem").

Sentinels

One unresolved issue with this proposal is how to handle sentinels. How should an operand type of [:0]u8 be handled? Is the sentinel considered a part of the length of memory being reinterpreted, or no?

On the one hand, it would be consistent with pointer casting today to not include the sentinel in the bytes being reinterpreted. We could allow keeping a sentinel which matched an input one (e.g. allow [:0]u8 to [:0]i8), but nothing more. That seems like the obvious solution at first glance.

However, there's a problem here! Consider now the type *[5:0]u8. Should the sentinel be included in the bytes being reinterpreted? Well, there are arguments both ways:

  • On the one hand, this type is usually considered to be a "more comptime-known" version of [:0]u8; so, it should inherit the behavior of that type, and not include the 0 sentinel in the "pointee bytes".
  • On the other hand, the pointee [5:0]u8 clearly has identical layout to [1][5:0]u8, and so reinterpreting their memory should behave the same; but *[1][5:0]u8 is pretty clearly 6 bytes (the sentinel definitely isn't "special" when you nest it in an aggregate in this way). If you're not convinced by the nested array, extern struct { arr: [5:0]u8 } might be more convincing.

So, I think you could reasonably expect either behavior here -- and getting this wrong could cause subtle bugs. Given that fact, I personally believe the best behavior is to disallow the operand from having a sentinel: the caller must either absorb the sentinel into the slice itself (related: #23023 which adds std.mem.absorbSentinel), or coerce the sentinel away (e.g. coerce [:0]u8 to []u8). For the avoidance of doubt, the exact operand types I propose disallowing are:

  • A slice with a sentinel (like [:s]T)
  • A single-item pointer to an array with a sentinel (like *[n:s]T)

I'm open to discussion on this point.

Examples

var val: u32 = undefined;
const ptr: *i32 = @memCast(&val);
const values: []u32 = getSomeData();
write(@memCast(values)); // where `write` takes a `[]const u8`
var buf: [10]u32 = undefined;
read(@memCast(&buf)); // where `read` takes a `[]u8`
// given:
//   const TypedIndex = enum(u32) { _ };
// we do this:
const type_erased: []u32 = getSomeData();
const typed: []TypedIndex = @memCast(type_erased);
// or, more simply:
const typed2: []TypedIndex = @memCast(getSomeData());

mlugg avatar May 20 '25 04:05 mlugg

Its a bit unclear from the proposal, will this example from the beginning work with @reinterpret?

fn write(buf: []const u8) !void { ... }

const x: u32 = someStuff();
try write(@reinterpret(&x));

AndrewKraevskii avatar May 21 '25 14:05 AndrewKraevskii

Yes; what's unclear there? It seems pretty obvious from the definition, plus that's pretty close to the penultimate example.

To be completely clear, @reinterpret from type A to B is permitted under the following conditions:

  • A is a single-item pointer or a slice
  • B is a single-item pointer or a slice
  • A is not a sentinel-terminated slice ([:s]T) or a single-item pointer to a sentinel-terminated array (*[n:s]T)
  • B is not a sentinel-terminated slice ([:s]T) or a single-item pointer to a sentinel-terminated array (*[n:s]T)
  • The element type of B has a well-defined layout
  • The number of bytes pointed to by the operand equals the number of bytes pointed to by the result (if necessary, the result slice length is chosen to fulfil this property)
    • If A is a single-item pointer (or it's a slice but the operand value is comptime-known), this is known at compile-time
    • If A is a slice and B is a single-item pointer, there is a runtime safety check for the exact length of the operand slice
    • If A is a slice and B is a slice, there may be a runtime safety check to ensure the lengths can match up (e.g. []u8 => []u32 needs to check that the operand slice length is divisible by 4)

mlugg avatar May 22 '25 07:05 mlugg

Update: some discussion has shown @reinterpret to be a problematic name. The issue is that different people take this word to mean different things; to myself, the implication is that I am reinterpreting the same data (and so, critically, the same amount of it!), but it seems that many people would understand this in a more general sense as an "unsafe" operation which doesn't care about the amount of memory pointed to. It was also pointed out that there is potential for confusion with @bitCast, since "reinterpret" could feasibly be a name for that operation.

This ambiguity makes @reinterpret an inappropriate name. So, right now, the plan is this:

  • This new builtin is to be called @memCast, taking inspiration from the fact that builtins like @memcpy and @memset use the "mem" term to refer to a specific region of memory
  • The existing @ptrCast builtin is to be renamed to @elemCast

This is still subject to change, but is likely close to what will happen here. Note that the semantics of this builtin seem pretty solid, and are unlikely to change; it's just the name which is being iterated on.

mlugg avatar May 22 '25 21:05 mlugg

Until I read to the part "Sentinels", I was starting to wonder how this is supposed to work with sentinels since I didn't think of an obvious way of making it work in an obvious way.

Because of that disallowing sentinels may be a good idea since it will cause enough confusion that many people will end up needing to look into the language spec. Even more so since (at least hopefully) it won't be something most people will do all the time.

KilianHanich avatar Aug 21 '25 17:08 KilianHanich

Perhaps I am missing something, but the proposal sounds like @memCast is not reversible given the restriction of well-definedness to only the result type, which doesn't make sense to me. Wouldn't we also want the source pointer type to have an element type with a well-defined layout/not have undefined bits? My reading of the proposal is that @as([]u8, @memcast(@as([]u7, slice))) is allowed but @as([]u7, @memCast(@as([]u8, slice))) is not[^1], but in the former it's not clear what I can expect of the high bit of each u8.

[^1]: Maybe this is allowed and I misunderstand what well-defined layout means for integers?

dweiller avatar Oct 29 '25 04:10 dweiller

in the former it's not clear what I can expect of the high bit of each u8.

Responding to this first since I think it's an important misunderstanding. Not only is the high bit of the u8 implementation-defined, but the entire value is. When representing a u7 in memory, a Zig implementation may choose to use the high 7 bits for storage, or the low 7, or the low 3 and top 4 with an unspecified bit in the middle, or use some ridiculously convoluted mapping function where @as(u7, 0) is represented as @as(u8, 123) for no good reason. This is essentially what it means for a type to have ill-defined layout; it is incorrect to assume anything about its representation (other than that it is stored in @sizeOf(T) many bytes).

the proposal sounds like @memCast is not reversible

That's correct; the idea behind this restriction is to make the operation safer. For instance, any block of memory can be legally interpreted as a []u8, regardless of its current contents. (The bytes you read might not be meaningful, but it's legal to read them!). However, because u7 has ill-defined layout, it is only correct to interpret a block of memory as a []u7 if it was initialized in that way. Even this code is IB:

const buf = try gpa.alloc(u8, 100);
@memset(buf, 0);
const casted: []u7 = @ptrCast(buf);
_ = casted[0];

...because, as explained above, 0x00 is not necessarily a valid bit pattern for a u7; it's entirely the implementation's choice.

Wouldn't we also want the source pointer type to have an element type with a well-defined layout/not have undefined bits?

Nope! Casting []u7 to []u8 is completely fine and accessing that []u8 is never IB. If that []u8 is never stored to, it's also then legal to cast that back to a []u7 and load from that slice. This is important because some code wants to be able to treat data as the underlying bytes; for instance, take a look at the implementation of std.mem.swap.


The line here is certainly a bit fuzzy, because "well-defined layout" doesn't necessarily mean all bit patterns are valid; for instance (assuming null pointers are 0) 0x0000_0000_0000_0000 is an invalid value for a *u8 even though *u8 (as with all pointer types) has well-defined layout. However, "ill-defined layout" has the distinction that the only legal way to initialize that memory is by going through the type you're casting to.

There is definitely an argument to be made that it would be far more valuable to have the "same number of bytes" guarantee in cases with ill-defined layout, in which case the restriction should be eliminated entirely. In fact, revisiting this now, I think I would probably lean that way myself.

mlugg avatar Oct 29 '25 12:10 mlugg

Responding to this first since I think it's an important misunderstanding.

Thank you for the clarification - I wasn't aware that integers did not have a well-defined layout, though I'm also not sure why I thought they all did. I even checked the langref the other day - we should add an explicit mention of which primitive types have a well-defined layout and which don't.

Sorry if anything below sounds overly pedantic or this is the wrong place to ask (maybe we should talk on zulip instead?), I just want to make sure I'm understanding things precisely.

This is essentially what it means for a type to have ill-defined layout; it is incorrect to assume anything about its representation (other than that it is stored in @sizeOf(T) many bytes).

If that []u8 is never stored to, it's also then legal to cast that back to a []u7 and load from that slice. This is important because some code wants to be able to treat data as the underlying bytes; for instance, take a look at the implementation of std.mem.swap.

Something seems a bit incongruent to me here. Do you mean 'if and only if the []u8 is never stored to' or (as I assume) only a one-way implication? If you mean the former I assume there is a distinction here between reading from a []u7 you already had, and casting []u8 to []u7 to read with, due to some pointer provenance stuff. From the definition of std.mem.swap I take it that there are some additional assumptions we can make about the representation of ill-defined types, these are:

  1. If you take a pointer to a T, the way the T pointed to is represented is consistent across the zig compilation; this means that validity of bit patterns doesn't change across different instances of *T/[]T and if you do something that writes a valid bit pattern there you can read it via a *T/[]T. To be explicit, if you write a valid bit pattern for a T by a casting *T/[]T to *U/[]U (or maybe just starting with a *U/[]U?), where U has well-defined layout and maybe some other requirements, and store a valid T bit pattern via the *U/[]U, then you can later read the value(s) through a *T/[]T (though maybe you can't cast that *U/[]U to a *T/[]T).
  2. If you take a pointer to a T, this pointer doesn't alias another value. For example, if you take a pointer to a u4 the compiler can't have packed it and another u4 into a u8 in some SWAR optimisation pass, it must have at least unpacked it again before giving you the pointer.

However, "ill-defined layout" has the distinction that the only legal way to initialize that memory is by going through the type you're casting to.

I guess this is just a way to ask if assumption (1) above is true: does 'going through' include casting from ill-defined layout to something else, i.e. is this valid:

const data: [3]u7 = .{ 1, 2, 3 };
var u7s: [3] u8 = undefined;
const u7_data_repr: *[3]u8 = @memCast(&data); // or a `@ptrCast`, assuming we know `@sizeOf(u7) == 1`
const u7s_repr: *[3]u8 = @memCast(&u7s); // or a `@ptrCast`
@memcpy(u7s_repr, u7_data_repr);
// use `u7s` and `&u7s`...

dweiller avatar Oct 30 '25 05:10 dweiller