zig icon indicating copy to clipboard operation
zig copied to clipboard

Integer coercion consideres non-type bits

Open ikskuh opened this issue 2 years ago • 12 comments

Zig Version

0.10.0-dev.1261+6f986298c

Steps to Reproduce

const std = @import("std");

pub fn main() !void  {
    var value: u4 = undefined;
    std.debug.print("u4={}, usize={}\n", .{
        value,
        @as(usize, value),
    });
}

Expected Behavior

Code prints

u4=10, usize=10

Actual Behavior

Code prints

u4=10, usize=170

It looks like the compiler is assuming that the upper bits of a u4 are 0, which doesn't happen when unitializing to undefined

ikskuh avatar Mar 22 '22 12:03 ikskuh

Seems like "intended" behavior to me. To print an int you have to branch on it, and branching on undefined is undefined behavior, so anything could happen.

Hejsil avatar Mar 22 '22 14:03 Hejsil

Basically, I'm not sure if I understand why you expect the usize and u4 versions to have the same value. undefined has no specific value. Going from u4 -> usize might give something different. The only thing you know here is that the u4 is undefined and so is the usize as well when you do the "conversion"

Hejsil avatar Mar 22 '22 14:03 Hejsil

This isn't actually a problem that comes from undefined. It also comes from very defined values:

const std = @import("std");

pub fn main() !void  {
    var value: [2]u4 = undefined;
    std.mem.set(u8, @ptrCast([*]u8, &value)[0..@sizeOf([2]u4)], 0xFF);
    std.debug.print("u4={}, usize={}\n", .{
        value[0],
        @as(usize, value[0]),
    });
    // prints u4=15, usize=255
}

u4 is the semantic guarantee that this type has 4 bits of data. If stored in other data, the 4 other bits are defined undefined (so "don't care" for the value itself).

When coercing any u4 with non-zero padding bits, it will yield the wrong value, as zig right now assumes they are defined zero. The problem itself is not inherent to undefined, but happens with all memory.

We either have to define padding bits of non-byte aligned integers to always be zero (then this issue wouldn't be a bug) or we have to fix the compiler implementation to ignore these bits (then this issue would be a bug).

Right now, u4 is a basically a u8 with the restriction that you cannot assign a number larger than 15 to it.

cc @SpexGuy

ikskuh avatar Mar 22 '22 15:03 ikskuh

@MasterQ32 both your code snippets are invoking UB with the current stage1 compiler implementation and spec defined by that implementation. Relevant bits from the llvm language reference:

On load IR instructions:

The location of memory pointed to is loaded. If the value being loaded is of scalar type then the number of bytes read does not exceed the minimum number of bytes needed to hold all bits of the type. For example, loading an i24 reads at most three bytes. When loading a value of a type like i20 with a size that is not an integral number of bytes, the result is undefined if the value was not originally written using a store of the same type.

https://llvm.org/docs/LangRef.html#load-instruction

On store IR instructions:

The contents of memory are updated to contain at the location specified by the operand. If is of scalar type then the number of bytes written does not exceed the minimum number of bytes needed to hold all bits of the type. For example, storing an i24 writes at most three bytes. When writing a value of a type like i20 with a size that is not an integral number of bytes, it is unspecified what happens to the extra bits that do not belong to the type, but they will typically be overwritten.

https://llvm.org/docs/LangRef.html#store-instruction

How we define these semantics long term is certainly up for discussion and we may want to move away from LLVM's approach, but I don't think calling this issue a bug is accurate.

ifreund avatar Mar 22 '22 15:03 ifreund

I agree with @ifreund and @Hejsil, this looks like correct undefined behavior. The semantics of undefined, as inherited from C/C++, is that the bits are not only random, but random on every individual access. If x = undefined, then in general it cannot be guaranteed that x == x. This definition carries with it in a number of known foot guns and WTF moments, but it is necessary for certain memory-related optimizations, so it can't be fixed without incurring costs elsewhere.

Related discussion: #7115

zzyxyzz avatar Mar 22 '22 19:03 zzyxyzz

The main issue seems to be that the compiler makes certain assumptions about the (bytes involved in the) representation of u4 when converting to larger integer types, that don't hold in this case. Whether this is a case of unchecked illegal behaviour from undefined, or of an incorrect assumption in the compiler implementation (missing a conversion operation, or missing a language-level assertion forbidding this) is up to interpretation until this scenario is fully standardized.

I've had thoughts about (what I believe to be) the same general issue in issue #6784 . By the terminology I used there, the bit patterns are invalid for the type u4. I think clear rules of where they are illegal behaviour (panicking in debug builds), (and maybe builtins to interface with them?) would be helpful.

rohlem avatar Mar 22 '22 21:03 rohlem

I'm not sure if this is the same issue, but when using zig for binary file formats, sometimes undefined memory is dumped into the file. I don't have a minimal reproduction yet.

Is it possible this the cause? An integer is being casted to a lower size somewhere, but the next byte(s) are being written and that results in undefined memory?

Jarred-Sumner avatar Mar 25 '22 13:03 Jarred-Sumner

@Jarred-Sumner yes, the 4th padding byte of e.g. a u24 would be left undefined currently. That's another decent argument in favor of defining the padding to be zeroed.

That doesn't mean that your bug is necessarily caused by this, it could be caused by writing any structure with padding or a non well defined memory layout to disk. The simplest way to solve that kind of bug would be to to zero the memory before you write to it.

ifreund avatar Mar 25 '22 16:03 ifreund

both your code snippets are invoking UB with the current stage1 compiler implementation and spec defined by that implementation. Relevant bits from the llvm language reference:

The implementation does not define anything, the language spec does, and it says "Converts an integer to another integer while keeping the same numerical value". Casting a u4 to a usize should zero-fill it. I agree that the first example is UB, but the second example is not.

Whether this is a case of unchecked illegal behaviour from undefined, or of an incorrect assumption in the compiler implementation (missing a conversion operation, or missing a language-level assertion forbidding this) is up to interpretation until this scenario is fully standardized.

It has been "standardized" per the language spec, which is IMO the clear and obvious right definition.

jibal avatar Apr 14 '22 23:04 jibal

@jibal There is no language spec currently. There is documentation, but it is incomplete. For things not covered by the documentation, an accepted language proposal, or similar all we have is the stage1 implementation. For this implementation it was decided to rely on llvm for arbitrary bit width integer support and llvm has the semantics I quoted above.

The unsound operation in example 2 I'm talking about is pointer casting *[2]u4 to *[2]u8. Since @bitSizeOf([2]u4) and @bitSizeOf([2]u8) differ, I believe this to be unsound. We don't have a language spec yet though so we can't be proper language lawyers yet.

To repeat what I said in my original comment, how we define these semantics long term is certainly up for discussion and we may want to move away from LLVM's approach.

ifreund avatar Apr 17 '22 16:04 ifreund

This is UB, but there should at the very least exist safety-checks for this, if it is decided to remain illegal behaviour.

ominitay avatar Apr 17 '22 17:04 ominitay

Related: #2301, #1966

ominitay avatar Apr 17 '22 17:04 ominitay