zig
zig copied to clipboard
allow integer types to be any range
Zig already has ranged integer types, however the range is required to be a signed or unsigned power of 2. This proposal is for generalizing them further, to allow any arbitrary range.
comptime {
assert(i32 == @Int(-1 << 31, 1 << 31));
assert(u32 == @Int(0, 1 << 32));
assert(u0 == @Int(0, 1);
assert(noreturn == @int(0, 0));
}
Let's consider some reasons to do this:
One common practice for C developers is to use -1 or MAX_UINT32 (and related) constants as an in-bound indicator of metadata. For example, the stage1 compiler uses a size_t field to indicate the ABI size of a type, but the value SIZE_MAX is used to indicate that the size is not yet computed.
In Zig we want people to use Optionals for this, but there's a catch: the in-bound special value uses less memory for the type. In Zig on 64-bit targets, @sizeOf(usize) == 8 and @sizeOf(?usize) == 16. That's a huge cost to pay, for something that could take up 0 bits of information if you are willing to give up a single value inside the range of a usize.
With ranged integers, this could be made type-safe:
const AbiSize = @Int(0, (1 << usize.bit_count) - 1);
const MyType = struct {
abi_size: ?AbiSize,
};
var my_type: MyType = undefined;
test "catching a bug" {
var other_thing: usize = 1234;
my_type.abi_size = other_thing; // error: expected @Int(0, 18446744073709551615), found usize
}
Now, not only do we have the Optionals feature of zig protecting against accidentally using a very large integer when it is supposed to indicate null, but we also have the compile error helping out with range checks. One can choose to deal with the larger ranged value by handling the possibility, and returning an error, or with @intCast, which inserts a handy safety check.
How about if there are 2 special values rather than 1?
const N = union(enum) {
special1,
special2,
normal: @Int(0, (1 << u32) - 2),
};
Here, size of N would be 4 bytes.
Let's consider another example, with enums.
Enums allow defining a set of possible values for a type:
const E = enum {
one,
two,
three,
};
There are 3 possible values of this type, so Zig chooses to use u2 for the tag type. It will require 1 byte to represent it, wasting 6 bits. If you wrap it in an optional, that will be 16 bits to represent something that, according to information theory, requires only 2 bits. And Zig's hands are tied; because currently each field requires ABI alignment, each byte is necessary.
If #3802 is accepted and implemented, and the is_null bit of optionals becomes align(0), then ?E can remain 1 byte, and ?E in a struct with align(0) will take up 3 bits.
However, consider if the enum was allowed to choose a ranged integer type. It would choose @Int(0, 3). Wrapped in an optional, it actually could choose to use the integer value 3 as the is_null bit. Then ?E in a struct will take up 2 bits.
Again, assuming #3802 is implemented, Zig would even be able to "flatten" several enums into the same integer:
const Mode = enum { // 2-bit tag type
Debug,
ReleaseSafe,
ReleaseFast,
ReleaseSmall
};
const Endian = enum { // 1-bit tag type
big,
little,
};
pub const AtomicOrder = enum { // 3-bit tag type
Unordered,
Monotonic,
Acquire,
Release,
AcqRel,
SeqCst,
};
pub const AtomicRmwOp = enum { // 4-bit tag type
Xchg,
Add,
Sub,
And,
Nand,
Or,
Xor,
Max,
Min,
};
const MyFancyType = struct {
mode: Mode align(0),
endian: Endian align(0),
atomic_order: AtomicOrder align(0),
op: AtomicRmwOp align(0),
};
If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.
This proposal would solve #747. Peer type resolution of comptime ints would produce a ranged integer:
export fn foo(b: bool) void {
// master branch: error: cannot store runtime value in type 'comptime_int'
const x = if (b) -10 else 100;
// proposal: @typeOf(x) == @Int(-10, 101)
}
With optional pointers, Zig has an optimization to use the zero address as the null value. The allowzero property can be used to indicate that the address 0 is valid. This is effectively treating the address as a ranged integer type! This optimization for optional pointers could now be described in userland types:
const PointerAddress = @Int(1, 1 << usize.bit_count);
const Pointer = ?PointerAddress;
comptime {
assert(@sizeOf(PointerAddress) == @sizeOf(usize));
assert(@sizeOf(Pointer) == @sizeOf(usize));
}
One possible extension to this proposal would be to allow pointer types to override the address integer type. Rather than allowzero which is single purpose, they could do something like this:
comptime {
assert(*addrtype(usize) i32 == *allowzero i32);
assert(*addrtype(@Int(1, 1 << usize.bit_count)) == *i32);
}
This would also introduce type-safety to using more than just 0x0 as a special pointer address, which is perfectly acceptable on most hosted operating systems, and also typically set up in freestanding environments as well. Typically, the entire first page of memory is unmapped, and often the virtual address space is limited to 48 bits making @Int(os.page_size, 1 << 48) a good default address type for pointers on many targets! Combining this with the fact that pointers also have alignment bits to play with, this would give Zig's type system the ability to pack a lot of data into pointers which are annotated with align(0).
What about two's complement wrapping math operations? Two's complement only works on powers-of-two integer types. Wrapping math operations would not be allowed on non-power-of-two integer types. Compile error.
I'm worried this might result in a code explosion as different types are passed to a generic.
Idea: input to a generic could be "marked" for which properties are used.
e.g. if I have a pure-storage generic, like ArrayList, then the range of the members doesn't really matter for the implementation of that generic: the only property it cares about at runtime is the size of the type, the rest is only useful at comptime.
@andrewrk i cannot understand how did you get required size of MyFancyType struct 'less that one byte ... leaving room for 235 more tags'.
Since struct is a product type, MyFancyType has a total of 4 * 2 * 6 * 9 = 432 possible states. Requiring at least ⌈log₂(432)⌉ = 9 bits to store them.
@Rocknest my math is wrong, thanks for the correction :+1:
wouldn't it be @Type(TypeInfo.Int{...}) where TypeInfo.Int looks like:
pub const Int = struct {
is_signed: bool,
bits: comptime_int,
from: comptime_int,
to: comptime_int,
};
because of the @Type builtin?
Yes except only from and to are needed - the rest is redundant information. I'd call it start and end to hint that the upper bound is exclusive. Good point, I forgot that @Int is deprecated.
@Int(0, 1) == @Type(.{ .Int = .{ .start = 0, .end = 1}})
@Type is a bit more verbose though, for such a fundamental type as an integer. I'm not 100% sold on @Type yet.
should this be marked a breaking proposal as well?
Yes except only
fromandtoare needed - the rest is redundant information.
You still need the bits. e.g. I might have an API that takes a usize but only allows values between 1 and 4096
1is >= 0, so it's unsigned- log2(4096) = 12, so it's 12 bits
You're thinking of a different, already planned issue, which is zig keeping track of additional information beyond the type, for expressions and const values, which will allow, for example, this test to pass:
test "runtime value hints" {
var a: u64 = 99999;
// `a` is runtime known because it is `var`.
const b: u64 = a % 256;
const c: u8 = b; // allowed without @intCast because `b` has a runtime value hint
}
i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed.
i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed.
That sounds like a usecase for a non-exhaustive enum:
enum(u4) {
Continuation = 0,
Text = 1,
Binary = 2,
Close = 8,
Ping = 9,
Pong = 10,
_,
}
This is very nice and fancy, but I would like to note that it might have a negative effect on performance. This is definitely more efficient memory wise, so for storing it (on disk) this is perfect. However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back.
Just something to think about.
However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back.
This is not correct, except with align(0) fields (see #3802). Zig already has arbitrary sized integers. If you use i3, for example, @sizeOf(i3) == 1 because @alignOf(i3) == 1.
Also in many cases if you were to use align(0) fields, the smaller memory use on the cache lines would cause more performance gains than lost to the extra bit shifting.
Yes, sorry I was thinking about align(0) cases. For the other cases this proposal only adds range checks, right?
Also in many cases if you were to use align(0) fields, the smaller memory use on the cache lines would cause more performance gains than lost to the extra bit shifting.
I'm a bit sceptical about this as I've had experience of the opposite. It probably depends a lot on the circumstances. Which is why I don't think a compiler should be sneaky and perform bit fiddling behind your back. However if it just applies to align(0), it is something which you ask for so it's fine.
Somehow I feel though that this is more something to do with optionals and other constructs which have more information than arbitrary numbers and enums, right?
It's more like in this example:
const N = union(enum) {
special1,
special2,
normal: @Int(0, (1 << u32) - 2),
};
Where each case has a special number or range of numbers, such as:
const N = union(enum) {
special1 = max32,
special2 = max32 -1,
normal = 0..max32 - 2,
};
And I don't think it goes for your other example where you have four, what to me seems independant, enums:
const MyFancyType = struct { mode: Mode align(0), endian: Endian align(0), atomic_order: AtomicOrder align(0), op: AtomicRmwOp align(0), };If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.
Aren't there 4 * 2 * 6 * 9 = 432 possible combinations of those four enums? Which means you still need 9 bits (or two bytes)?
In general I do think it would be great if the optionals can work more generically. Not having to write all the -1 checks would be cool. Especially since sometimes people work with indices instead of pointers and the special value trick is quite common.
Note that integer ranges have a long history in some other languages like Pascal and Ada. There they are heavily used and can be used as ranges for arrays. This leads to some nice code where you can use ranged for loops without any possibility of going out of bounds.
It has been decades since I used Pascal or Ada in anger but even back then I seem to remember that most run time checks could be dropped as the compiler could prove that the use was safe. A very naive compiler would do the checks all the time, but existing compilers show that the possible overhead is quite low. As long as the programmer is aware (such as choosing different arithmetic with overflow wrapping) that this could have a cost, I think this is fine.
The driving case here: more compact representation of enums and optionals is a very strong one IMHO. Having just optionals use the same space as non-optional types would be a huge argument to get programmers to use them. The cache argument is a strong one too. A cache miss can cost 25-50 clock cycles (if you hit L2, worse if you fall all the way to RAM) and you can run a lot of bit twiddling code in that time.
I like the core of this proposal, but there are a few things I'm worried about.
First, I feel like the optimizations are too implicit. You have to hope that you left the correct number of values open. If you do it wrong, there is no indication that you have messed up, you just have bigger types than you expected. As code changes over time, it's easy for additional values to be added to enums that are sensitive to this, which may break these optimizations without warning. If I'm writing Zig and my goal is to remove the overhead of the optional, I would want a guarantee from the compiler that my code has received the optimization, or a compile error. The original suggestion is difficult to think about when optional optimization is your end goal, and it fails silently when the optimization can't be applied.
I'm worried that, because it's so implicit, general zig advice may become "put explicit bounds on every integer type you use, so that the compiler can optimize it". This could create a lot of additional cognitive load when reading and writing zig code. Does ?@Int(0, 262144) receive the optimization? It doesn't, but it's not immediately apparent unless you happen to know off the top of your head that 262144 is 1<<18.
It also seems to me like working with these types would in many cases require @intCast peppered all over the place. If you want to iterate over the range, you need to be able to test when you've gone past the end. Which means you need to use a larger integer type for the iterator and cast it inside the loop. If you want to do math between ranged ints and store the result in a ranged int, you will very probably need a range cast. Those casts can harden the code, making it more difficult to change the ranges if your requirements change. They also have the potential to make generic code that uses ranged ints very difficult to write correctly. And if you get the range wrong, it can't be a compile error because it's a cast. It can only manifest as checked UB. How much mental bandwidth are you willing to spend on calculating the ranges of your intermediate values?
Finally, while applying range information to type resolution would solve #747, it also introduces a problem when assigning directly. var x = 4; would now compile, but set the type of x to @Int(4, 5), which is not a very useful type for a var. This would then move the compile failure to the line where x is modified, with a pretty cryptic error message (or worse, checked UB, depending on how arithmetic between ranged types resolves).
There is a precedent for an opaque optional optimization in Zig: ?*. This generates a sentinel value zero, and I think we can all agree that this is a Good Idea. Where I think the above proposal differs is that this new optimization is dependent on picking range constants that obey properties that are not immediately obvious. I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.". This is mechanically the same as making the flag in optionals align(0), but (at least for me) is a much easier wording to load into Brain RAM and work with. We could then add explicit mechanisms to trigger the other optimizations suggested above if you need them.
Here are some off-the-cuff suggestions. There's definitely room for improvement for these, but I think they do a good job of showing the direction I'm looking.
Range-checked Explicit Integers
These are the same as the original proposal, but the integer bit-width is specified by the programmer. For example, @RangedInt(u32, 5, 1<<32-1) or @RangedInt(comptime_int, 10, 101). If the range cannot be represented by the underlying type, it is a compile error. This provides a path for type resolution of arithmetic on ranged integers that won't introduce surprises if ranges change. The type resolution of addition of these values is the result type of addition of the explicit integer types of the two operands, restricted to the maximum range of addition of the two operands, saturated to the maximum representable range of the resolved bit type. For convenience, we could supply a library function SmallestRangedInt(comptime min_value: comptime_int, comptime exclusive_max_value: comptime_int) type. But I don't think an inferred type should be the language default.
Integer Unions:
An integer union (or alternatively, "partition" or "ranged enum") would be a union of non-overlapping ranged ints. It is guaranteed to be the size of its backing integer type.
const Foo = partition(u32) {
normal = 0..(1 << 32) - 2,
special1 = (1 << 32) - 2,
special2 = (1 << 32) - 1,
};
This is very similar to the union example in the original proposal, but with this if the range is specified incorrectly or a new value is added, there would be a compile error instead of a drastic change in representation. If the integer type is not specified, it would be inferred from the ranges as the smallest type that can represent the range. But the ability to specify the bit width and receive an error if it can't be represented is valuable. Ranged unions could also support the '_' option from extensible unions.
Explicit Sentinel Optionals:
If an optional type could be specially constructed by specifying a sentinel, this would solve sentinel checks in a way that guarantees the optimization. Something like @OptionalSentinel(comptime IntType: type, comptime sentinel: meta.BackingIntegerType(IntType)). This documents the sentinel value, which makes debugging much less painful. It also guarantees that the optimization is made, so you don't need to worry that changes to the underlying int range will reintroduce the optional flag bit. IntType could be an integer, ranged integer, enum, or integer union.
None of these counter-proposals are perfect, but I like that they take more direct control over the optimization. E.g. instead of thinking "I'd like to optimize this optional to use a sentinel value, I should put ranges this int type everywhere it's used", I can instead think "I'd like to optimize this optional to use a sentinel value, I should tell zig to optimize this optional to use a sentinel value.".
I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.".
Sorry, small nitpick. It's not actually bits, it's unused values. For example in the pointer case 0 isn't a specific bit, it's just a special value. It's a slight difference. Same with the common -1/max-uint, it's not just the last bit, it's actually one value. But I see from your examples/proposals that you understood.
I agree with you that it would be cleaner if it's more explicit.
Sorry, yeah, my wording was unclear. I'm trying to convey that the simpler guarantee of a bit for optionals of non-power-of-two integers is easier to guarantee and think about than an inferred sentinel, and that something like @OptionalSentinel might be better for when a sentinel value is needed.
If we make it so an optional ranged type interprets any out-of-range value as null, we could do things like this:
// Zero-cost wrapper for an ABI that uses negative values as error codes
const Result = fn (Int: Type) Type {
return union {
// Nullable limits to avoid generating unnecessary bounds checks
success: @Ranged(Int, 0, null),
failure: @Ranged(Int, null, -1),
};
};
Define canonical null as the most negative value above the range if it exists, else the most positive value below the range. That way, even if a type is multiple-optionalised, the span of non-null values at any level is always contiguous, and does not change with type widening/narrowing in the most common cases.
Possible optimisation: allow nesting of @Ranged optionals, and only allow the ? type operator to be used on ranged types (pointer types are implicitly ranged), like so:
// Can be optimised to u32 with tag
// -- don't have to set global rules
const OptionalU32 = ?@Ranged(u32, null, maxInt(u32));
// Don't need another range -- canonical null of
// OptionalU32 is inner null, anything else is outer null
const OptionalOptionalU32 = ?OptionalU32;
// ;)
const Bool = ?@Ranged(u1, 1, null);
const True: Bool = 1;
const False: Bool = null;
const And = fn (T: Type, a: ?T, b: ?T) ?T {
return if (a) { b } else { null };
};
const Or = fn (T: Type, a: ?T, b: ?T) ?T {
return a orelse b;
};
const Not = fn (a: Bool) Bool {
return if (a) { null } else { 1 };
};
Pro
- More explicit in intention
- Optionals can now be used in packed objects
- Booleans don't have to be primitive ;-) Con
- Optionalising arbitrary calltime-known type now impossible, though I can't think of a use case for this that isn't possible another way
You could use INT_MIN for null, as it already has some pretty funny properties. This still requires a new type (that interestingly would be agnostic to twos-complement and ones-complement), but it is far less invasion that this proposal, which IMHO belongs in an optimizing compiler and not zig.
We already have a ranged integer type. It is called enum. This proposal creates far more problems that it solves.
An enum is not a ranged integer type. It does not expose any arithmetic properties, but only distinct values that have no relation to each other except for being in the same set.
The virtual machine that zig inherited from llvm which inherited it from C is a computer that thinks in bits, and this is successful because it how a computer thinks. If you are interpreting a language rather than modeling a computer, then you should use language theory, and enum is a building block for that[1], and then translate the language into the language of the computer, which thinks in bits. This proposal doesn't solve any of these problems, it only creates problems because it is a premature optimization of an ill-understood (and non-existent) problem.
[1] but you also need this https://codasip.com/2019/09/04/better-benchmarks-through-compiler-optimizations-codasip-jump-threading/ as I mentioned here https://bugs.llvm.org/show_bug.cgi?id=39043 and is necessary to port re2c to zig. (we know this is possible, so it is not a problem for the zig language, only an implementation detail)
If this is not just about bit-packing, but merging arbitrary value ranges, then extracting the original values would not be a matter of a simple shift and mask, but a division, modulo and subtraction, making it a dubious "optimization" if access is otherwise cache-optimal.
If this happens, it should at least be more explicit, as in SpexGuy's Integer Unions.
By the way, with this proposal Zig could remove the comptime_int type. Any integer type that only contains a single value is equivalent to today's comptime_int type. i.e. @TypeOf(123) == @Int(123, 124). Just like the type u0, any integer type with a single value takes no runtime bits to store and is always known at comptime.
The report for C lang suggest implementation-defined (lol, not portable) definitions https://resources.sei.cmu.edu/asset_files/TechnicalNote/2007_004_001_14846.pdf Though, as usual for C-stuff, report is underspecified as it does not elaborate on "performance overhead" vs "flexibility" tradeoff and it does not constrain to rule out UB or guarantees portability inside "runtime constrains" on "runtime constraint handler". Nevertheless the document is a good starting point to understand potential use cases. And terminology 1.range-min, 2. range-max, 3.storage policy, 4.modwrap semantics and 5. saturation semantics is relative good to discuss the feature.
Derived is a list of incompleteness of current proposals:
- Currently modwrap and saturation semantics are completely missing. modwrap should be likely a compile error.
- I have not seen arguments why one should not reuse
structandunionandenumwith a prefix, ierangedInt struct = {..}or some better keyword. - Arithmetic semantics are missing as well, especially for
union, where the used tag defines the wrapping semantics of addition on multiple overlapping ranges. - When one wants be picky on operator overloading: Even for enums the used tag decides the bound checks, which can be interpreted to bounds check overloading (it is not trivial to see which bound is being applied). From my point of view this goes against the zig of zen: It is not evident from the surrounding code what bound check is applied/what the code does.
- Especially nested ranges are dirty, as the cost of checks are not immediately evident from looking at code. On the other hand forbidding nesting is not a very general solution.
Personally I think stuff like "merging enums together" and "merging integers together" should be done tool-assisted and code-transparent, because arguing about that kind of indirection in terms of time and memory is hard and even harder on debugging.
This requires sub-bit calculation of number of bits required to store a struct. What would the ABI be?
0 .. 2^8 - 1 takes up 8 bits (u8)
1 .. 2^8 takes up... how many bits?
Is losing 1 bit not a solution? ?u31 is fine for expressing length of array, probably.
I've been writing some code recently that I've felt could use range validation so I went looking to see if someone had already written a proposal. I think defining int ranges (and floats too!) could provide stronger runtime checking, moving errors from runtime to compiletime, and better performance. So here are my takes from a "just some person writing code in Zig" perspective.
I definitely think the type and the range should be explicitly defined, like in SpexGuy's example. One reason is that it would be hard to increase the range of something without the risk of breaking serialization (save/load).
I would also like to define ranges on floats. Semantically this makes sense, especially when the main purpose isn't to reduce bits, which IMHO is not the primary use case for this feature, rather it is to get errors as soon as possible when things are out of range.
In my current code, I use variables that work better when being signed. Positions need to be math'ed on in a way that can cause negative numbers, for example abs when calculating distance. They also need to be used as indices into slices. This causes me to either choose the most natural fit, and have a lot of casts, or use just signed OR unsigned which... also requires a lot of casts.
I would like to be able to say "this is signed, but it is also >= 0", thereby allowing it to be used as an index.
var a : i8(0..) = getV();
var b : u32 = a; // coerces without @as/@intCast
var c = mySlice[a]; // totes cool
Something that would probably be useful:
var a : i32(0..10) = getV();
var b : i32(0..10) = getV();
var c : i32(0..50) = a * b; // Compile error "ranges not guaranteed bla bla"
var d : i32(0..50) = @asRange(0..50, a * b)); // Runtime error if indeed a * b was too big
A few wild ideas:
A:
fn main() {
var array = [_]i32{ 1, 2, 3, 4 };
var index1 : i32 = getI();
var index2 : i32(0..) = getI();
var v1 = array[index1]; // Runtime error, no change from now
var v2 = array[index2]; // Compile error
}
B:
fn main() {
var players : [](100)u8 = getPlayers(); // Slice of known max len
var index1 : i32 = getI();
var index2 : i32(0..) = getI();
var v1 = players[index1]; // Compile error, "must specify range for slice of known size"
var v2 = players[index2]; // Compile error, "fix yo ranges"
}
C:
fn lol(foo: i32(0..bar.len/2), bar: []u8) i32(bar.len/2..bar.len) {
_ = bar[foo];
return foo * 3; // Compile error
}
fn main() {
var array = [_]i32{ 1, 2, 3, 4 };
var index = lol(100, array); // compile error
print(array[index]);
var some_slice = getSlice();
lol(100, some_slice); // runtime error when some_slice.len < 100
}
D:
const MyBuf = struct {
data: []u8;
var header_ends_here_index : u32([email protected]);
players: []Player,
goalies: []i32([email protected]); // zero never valid player id
}
Zig has this notion of reducing typing friction for things that makes code safer/better. With the potential usefulness of this, having specific syntax for it doesn't seem totally crazy. Compare the writeability and readability of:
var a : i32(-1..) = 1;
var b : @RangedInt(i32, -1, (1 << 32) - 2) = 1
(Note that I don't care particularly if it's (0..) or <[0-->]> or whatever :D I'm you know, just sayin')
I'm not sure how viable it is to use runtime-generated ranges like in my examples above, but if it was, then I feel that would be a "higher maxima" than stopping at compiletime support, which is worth considering.
I read a blog post about this some time ago that I think deserves some attention, link.
By creating a unique "int" type with infinite precision (conceptually a BigInt, but only while typechecking) we can avoid many (type) annotations and runtime checks normally expected for ranged integer proposals. Then, only at the end of all our calculations do we have to check that our number is still in the expected range, and if not, require that the programmer specify their desired narrowing method, so the compiler can automatically handle overflowing operations. For many calculations, the compiler can indeed ensure that our calculations remain in the representable range, as mentioned previously. The novel idea here is to allow an arbitrary amount of overflow to occur by deferring the error checking until absolutely necessary, to allow calculations to happen in the wonderful world of infinite precision integers for as long as possible.
The author of the blogpost envisioned two narrowing methods, and I would propose a third (which is already available in Zig):
fn @check(T: type, value: int) ?T;
fn @wrap(T: type, value: int) T;
fn @saturate(T: type, value: int) T; // Not mentioned in the blog post, but available in Zig.
// For "undefined behaviour", use:
@check(...) orelse unreachable;
I understand that there is already a separate issue for avoiding casts, #747. Once ranged integers are implemented as in the blog/outlined above, implementing this would be especially trivial:
test "runtime value hints" {
var a = 99999; // implicitly has type `int(99999..99999)`
const b = a % 256; // implicitly has type `int(0..255)`
const c: u8 = b; // allowed without @intCast because `int(0..255)` is always representable by `u8`
}
another, perhaps more interesting example:
const a = lhs & 0x7fff; // int(0..0x7fff)
const b = rhs & 0x7fff; // int(0..0x7fff)
const res1: u16 = a + b; // int(0..0xfffe), no runtime checks or annotations required!
const res2: u8 = @wrap(u8, a + b); // does not fit in u8, we have to specify the narrowing method explicitly.
const res2: u8 = (a + b) & 0xff; // should probably also be allowed, might warrent some discussion.
As for @Srekel's comment: I too absolutely believe that allowing arbitrary integer ranges everywhere would be amazing, although I do believe that might need to be it's own issue. I have spent years now looking into this, but arbitrary expressions for integer ranges dips pretty deep into Dependent Type territory
I think this issue should stay focused on compile-time/compiler-inferred integer ranges for the purposes of avoiding casts and overflow checks.
Another thing mentioned in the blog and also by @Srekel, which I'd like to reiterate: Integer ranges should be declarable on each integer type, not just on int. struct { waste_of_space: u64(0..255) }; // I know what I'm doing! should be possible.
Sidenote, most math operations have trivial range invariants:
// trivial:
@min(int(a..b), int(c..d)) = int(min(a,c)..min(b..d))
@max(int(a..b), int(c..d)) = int(max(a,c)..max(b..d))
int(a..b) + int(c..d) = int(a+c..b+d)
int(a..b) - int(c..d) = int(a-c..b-d)
int(a..b) * int(c..d) = int(a*c..b*d)
int(a..b) << int(c..d) = int((a << c)..(b << d))
// tricky: (at least we know for sure the magnitudes don't change)
int(a..b) & int(c..d) = ???
int(a..b) | int(c..d) = ???
int(a..b) ^ int(c..d) = ???
// conservatively:
// p = prev power of two
// n = next power of two
int(a..b) & int(c..d) = int(p(min(a,c))..n(max(b,d)))
int(a..b) | int(c..d) = int(p(min(a,c))..n(max(b,d)))
int(a..b) ^ int(c..d) = int(0..n(max(b,d)))
// not sure:
// do the `/` bounds account for signed ints/rounding?
// also, as explained in the blog, these require additional effort with infinite precision (>64 bits).
int(a..b) / int(c..d) = int(a/d..b/c)
int(a..b) % int(c..d) = int(0..min(b,c))
int(a..b) >> int(c..d) = int((a >> d)..(b >> c))
These invariants don't actually have to be defined exactly, the only impact would be that programmers might have to annotate their calculations in a few situations where the compiler could theoretically have done without. For ^ in particular, I think that would be totally fine.
The novel idea here is to allow an arbitrary amount of overflow to occur by deferring the error checking until absolutely necessary
Does the blog mention Interval arithmetic and "Overapproximating of Reachable Sets"? It boils down to either 1. verification/over-approximation of reachable set (is expensive, leads to false positives or can be infeasible) or 2. runtime range checks on each usage, which is very expensive on necessary division (checking if multiplication is out of range). Let alone the branch mispredictions in tight loops and missed out optimizations.
The "removing of bound checks" etc requires verification tools like absint https://www.absint.com/ unless the code defines more simple math properties.
Dependent Type territory
I'm curious what amount of complexity that requires (which is not what Zig aims for). By Gödel's incompleteness theorem https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems, we cant reason within a system over all properties of the system, so we must give up to reason about some semantics one or the other way.
Sidenote, most math operations have trivial range invariants:
In mathematics, an invariant is a property of a mathematical object (or a class of mathematical objects) which remains unchanged after operations or transformations of a certain type are applied to the objects. https://en.wikipedia.org/wiki/Invariant_(computer_science)
Your examples are the application of the operations, but not the change of reachable set required to infer or approximate an invariant.
I've been thinking a lot about this proposal recently - I'm in strong support of it, and wanted to quickly note down some of the things it would make possible which I think are nice, and some interesting points about it.
- As previously mentioned, this entirely eliminates the need for
comptime_int. A literal3has type@Int(3, 4). PTR on int types@Int(a, b)and@Int(c, d)gives@Int(@min(a, c), @max(b, d)).- Note on representation: I believe the in-memory representation of an
@Int(a, b)should be the smallest unsigned or signed twos-complement integer which can hold all values in that range. Except, maybe@Int(a, a+1)should be an exception and be a zero-bit type.
- Note on representation: I believe the in-memory representation of an
- All standard arithmetic operations can create a precise result type, making overflow impossible. That way, all possible overflow panics are explicitly written in code using
@intCast.- Bitwise, wrapping, and perhaps saturating operations all require the operands to be a perfect
uXoriX a % bwherebis comptime-known has a result type of@Int(0, b)
- Bitwise, wrapping, and perhaps saturating operations all require the operands to be a perfect
- This proposal meshes much more nicely with the new type-refining behavior of
@minand@max. It'll make clamping to ranges actually work, and make the result type more precise, which is in particular helpful forswitching on it.- e.g.
@max(10, @min(my_i32, 150))currently can't refine the type at all on its own, and doing@min(@max(10, my_i32), 150)only gets you down to au8. Under the proposal, both of these would give you the exact correct type,@Int(10, 151).
- e.g.
One reasonable concern regarding automatically determined result types is the runtime cost. However, this should I believe be fairly easy to optimize in release modes; if the result of a chain of arithmetic is @intCast down to a small type, most of the time, it's safe for operands to be truncated to that type. For instance, if the result of a standard multiplication is intcast down to a u64 from some larger unsigned integer, then either the result is zero (in which case any truncation preserves the result) or both operands also fit in a u64 (so can in turn be intcast down). There are similar simple rules for most operations and types, so it should be fairly easy to write an optimization pass for this if LLVM isn't capable of doing a good job by itself.
I had more thoughts in my head, but they've kinda gone now. I'll edit this comment if they come back to me.
EDIT: one thing to note is that in debug builds, I think this type-expanding logic could at times give better runtime performance, since for a string of integer operations, it elides any safety checks right up until the @intCast of the final result, whereas in status quo every operation would have an overflow check. The intermediate types might be a bit bigger, so maybe e.g. some arithmetic ops require a couple of instructions (again, only in debug; in release the optimizations discussed above apply), but that's probably better than an overflow check and branch!