rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Generic Integers V2: It's Time

Open clarfonthey opened this issue 1 year ago • 46 comments

Summary

Adds the builtin types u<N> and i<N>, allowing integers with an arbitrary size in bits.

Rendered

Details

This is a follow-up to #2581, which was previously postponed. A lot has happened since then, and there has been general support for this change from a lot of different people. It's time.

There are a few key differences from the previous RFC, but I trust that you can read.

Thanks

Thank you to everyone who responded to the pre-RFC on Internals with feedback.

clarfonthey avatar Sep 01 '24 19:09 clarfonthey

As much as I dislike as casts and would prefer a better solution, until that solution exists (in the standard library), this is probably the best way to go.

:+1: from me

jhpratt avatar Sep 05 '24 07:09 jhpratt

Even if we should probably leave them out of the initial RFC for complexity reasons, I would just cheat with floats, as they rely on system libraries and hardware instructions way more than regular integers. By that, I mean that I'd allow f<32> for consistency reasons, but only those that are actually supported would compile; i.e., f<3> would throw a compile-time error (it could either be done at monomorphisation time, or disallowing const generics on that one). Overall, I think this RFC is in the right track, but I'd postpone it until we're past Rust 2024.

Alonely0 avatar Sep 05 '24 16:09 Alonely0

Overall, I think this RFC is in the right track, but I'd postpone it until we're past Rust 2024.

Are you proposing delaying the discussion or the implementation?

My understanding is that with a release early 2025, Rust 2024 will be done by mid November, which is only 2 months away, and it seems quite unlikely this RFC would be accepted and implementation ready to start by then, so I see no conflict with regard to starting on the implementation...

... but I could understand a focus on the edition for the next 2 months, and thus less bandwidth available for discussing RFCs.

matthieu-m avatar Sep 05 '24 17:09 matthieu-m

Even if we should probably leave them out of the initial RFC for complexity reasons, I would just cheat with floats, as they rely on system libraries and hardware instructions way more than regular integers. By that, I mean that I'd allow f<32> for consistency reasons, but only those that are actually supported would compile; i.e., f<3> would throw a compile-time error (it could either be done at monomorphisation time, or disallowing const generics on that one).

The problem with this approach is that any "cheating" becomes permanently stabilised, and thus, it's worth putting in some thought for the design. This isn't to say that f<N> is a bad design (I personally don't like it, but I won't fault people for wanting to use it), but rather that u<N> and i<N> are good designs in several ways that f<N> is not.

Plus, monomorphisation-time errors were actually one of the big downsides to the original RFC, and I suspect that people haven't really changed their thoughts since then. Effectively, while it's okay to allow some of edge-case monomorphisation-time errors like this RFC includes (for example, asking for u<0xFFFF_FFFF_FFFF> is a hard error, since it's larger than u32::MAX), but not extremely-common errors like just asking for f<N> where N is anything that isn't 16, 32, 64, or 128.

One potential solution that was proposed for unifying u<N>, i<N>, usize, and isize was to have some separate ADT that encapsulates signedness and has different options for "size" and N. This kind of solution feels promising for generic floats since it means that you could have an impl like:

impl<const F: FloatKind> MyTrait for f<F> {
    // ...
}

And it would support all float types, forever, and there would be no invalid values for F since we've explicitly defined it. However, this requires const_adt_params which is currently unstable.

Overall, I think this RFC is in the right track, but I'd postpone it until we're past Rust 2024.

As stated: yes, RFCs take time to discuss and implement and it's very reasonable to expect people to focus on the 2024 edition for now. However, that doesn't mean that we can't discuss this now, especially since there are bound to be things that were missed that would be good to point out.

clarfonthey avatar Sep 05 '24 18:09 clarfonthey

I love this!

One point that is touched upon here is aliases for uN <=> u<N>.

I think that'd be super valuable to have. Rust already has a lot of symbols and being able to not use the angle brackets makes sure that the code is much calmer to look upon. It's also not the first explicit syntax sugar since an async fn is treated the same as fn -> impl Future in a lot of places.

Having the aliases also allows for this while keeping everything consistent:

fn foo<const N: usize>(my_num: u<N>) { ... }

foo(123); // What is the bit width? u32 by default?
foo(123u7); // Fixed it

diondokter avatar Sep 06 '24 13:09 diondokter

I love this!

One point that is touched upon here is aliases for uN <=> u<N>.

I think that'd be super valuable to have. Rust already has a lot of symbols and being able to not use the angle brackets makes sure that the code is much calmer to look upon. It's also not the first explicit syntax sugar since an async fn is treated the same as fn -> impl Future in a lot of places.

I agree with you, just didn't want to require them for the initial RFC, since I wanted to keef it simple. Ideally, the language will support uN aliases as well as uN suffixes.

clarfonthey avatar Sep 06 '24 14:09 clarfonthey

Two things that came to mind:

  1. Are there any issues with the self-referentiality of these types? Although usize is a distinct type, one could easily imagine wanting to make it a "new-type wrapper" around the appropriate integer type, which would make a circular dependency between the two implementations. We could say that usize is not implemented that way, but then it's surprising to me that usize would be the "foundation" rather than the other way around.
  2. Even though LLVM can express integers of arbitrary size, it seems unlikely that these types have seen extensive use with unusual sizes. Maybe these integer types should be lowered to common integer types within rustc, so that backends can be simplified.

Diggsey avatar Sep 07 '24 15:09 Diggsey

When the last RFC was postponed, the stated reason was waiting for pure library solutions to emerge and letting the experience with those inform the design. I don't really see much of this in the current RFC, so here's a bunch of questions about it. It would also be great if some non-obvious design aspects of the RFC (such as limits on N, whether and how post-monomorphization errors work, padding, alignment, etc.) could be justified with experience from such libraries.

So, I agree that this was one of the reasons, but it's worth reiterating that also, at that time, const generics weren't even stable. We had no idea what the larger ecosystem would choose to do with them, considering how many people were waiting for stabilisation to really start using them. (We had an idea of what was possible, but not what would feel most ergonomic for APIs, etc.)

So, I personally felt that the library solution idea was mostly due to that fact that we didn't really know what libraries would do with const generics. And, overwhelmingly, there hasn't been much interest in it for what I believe to be the most compelling use case: generalising APIs without using macros, which right now cannot really be done without language support.

clarfonthey avatar Sep 07 '24 15:09 clarfonthey

By itself, the observation that the ecosystem isn't rushing to write and use such libraries doesn't tell us much. What if there just isn't much demand for generalizing code like that? On the other hand, if people have tried to write this code, and have run into concrete problems that they can't solve with more library code, then that is useful information for extending the language. Our discussion above has already focused on this. In addition, if there's any practical experience with API design, regardless of QoI problems like runtime performance / compile time / compiler diagnostics, then it would be useful to incorporate that into the RFC.

hanna-kruppe avatar Sep 07 '24 16:09 hanna-kruppe

  1. Are there any issues with the self-referentiality of these types? Although usize is a distinct type, one could easily imagine wanting to make it a "new-type wrapper" around the appropriate integer type, which would make a circular dependency between the two implementations. We could say that usize is not implemented that way, but then it's surprising to me that usize would be the "foundation" rather than the other way around.

I mean, ultimately, the compiler can pretty easily deal with this kind of self-referencing by nature of what it is. Whenever the compiler even sees u<N> or usize, it's immediately going to replace them internally with some native backend integer type, and be able to triage integer intrinsics just fine. In fact, it already does this for usize: the difference between usize and uN, where N is the target pointer width, is effectively zero outside of library methods, where they're explicitly different types due to API concerns.

It would be an issue to have, for example, Struct<S> where S: Struct<S>, since it's unclear how the compiler can figure what S is without first figuring out what S is. But the compiler already knows what usize is, so, it doesn't need to try to figure it out before continuing.

  1. Even though LLVM can express integers of arbitrary size, it seems unlikely that these types have seen extensive use with unusual sizes. Maybe these integer types should be lowered to common integer types within rustc, so that backends can be simplified.

I mean, even C is adding these now as features, so, I think that regardless of whether the compilers are robust enough, they will be robust enough by the time an implementation in Rust comes to fruition.

clarfonthey avatar Sep 07 '24 16:09 clarfonthey

I mean, ultimately, the compiler can pretty easily deal with this kind of self-referencing by nature of what it is. Whenever the compiler even sees u<N> or usize, it's immediately going to replace them internally with some native backend integer type, and be able to triage integer intrinsics just fine. In fact, it already does this for usize: the difference between usize and uN, where N is the target pointer width, is effectively zero outside of library methods, where they're explicitly different types due to API concerns.

That's not true. usize and u64 remain distinct types throughout the compiler, and they become the same only during LLVM IR generation. See e.g. UintTy, which indicates "which kind of unsigned integer" a TyKind::Uint is.

RalfJung avatar Sep 09 '24 14:09 RalfJung

When stored, u<N> should always zero-extend to the size of the type and i<N> should always sign-extend. This means that any padding bits for u<N> can be expected to be zero, but padding bits for i<N> may be either all-zero or all-one depending on the sign.

how does this interact with enum layout optimizations? can Option<i<7>> be represented in a single byte?

lolbinarycat avatar Sep 09 '24 20:09 lolbinarycat

how does this interact with enum layout optimizations? can Option<i<7>> be represented in a single byte?

that i<7> is stored sign-extended to i8 implies that, yes, Option<i<7>> can (but is not required to) be stored in a single byte since the compiler can pick any of -128..-64 or 64..128 to represent None since those are all out of range of i<7>

programmerjake avatar Sep 09 '24 20:09 programmerjake

wrong but left for posterity

how does this interact with enum layout optimizations? can Option<i<7>> be represented in a single byte?

that i<7> is stored sign-extended to i8 implies that, yes, Option<i<7>> can (but is not required to) be stored in a single byte since the compiler can pick any of -128..-64 or 64..128 to represent None since those are all out of range of i<7>

So, it's slightly more complicated than that; -128i8 and -64i7 have the same bit pattern in memory: all ones. However, you're right that there are values in there that don't, like positive 64 (which is 0b1_0000000, and thus invalid since the 0 in the 7th bit doesn't match the 1 in the 8th bit.

So, really, the valid ranges ranges are, assuming we're treating the number as an i8, -64..0 and 64..128.


That's not true. usize and u64 remain distinct types throughout the compiler, and they become the same only during LLVM IR generation. See e.g. UintTy, which indicates "which kind of unsigned integer" a TyKind::Uint is.

This is where I definitely should have been more careful in my wording: I meant that the backend doesn't distinguish them, but you're right that the type system needs to distinguish them all the way until the code has all been generated. My statement was basically wrong as-is.

clarfonthey avatar Sep 09 '24 20:09 clarfonthey

Why do they even have that lever?

clarfonthey avatar Sep 09 '24 20:09 clarfonthey

So, it's slightly more complicated than that; -128i8 and -64i7 have the same bit pattern in memory: all ones.

you need to check your math, -1 has all ones, but -64i7 as u8 is 0xC0 and -128i8 as u8 is 0x80 -- neither of them are remotely all ones nor the same bit pattern.

programmerjake avatar Sep 09 '24 21:09 programmerjake

So, it's slightly more complicated than that; -128i8 and -64i7 have the same bit pattern in memory: all ones.

you need to check your math, -1 has all ones, but -64i7 as u8 is 0xC0 and -128i8 as u8 is 0x80 -- neither of them are remotely all ones nor the same bit pattern.

You're right 🤦🏻

clarfonthey avatar Sep 09 '24 21:09 clarfonthey

I want to chime in here and link to another library: https://github.com/danlehmann/arbitrary-int That lib is used in both @danlehmann's and my own bitfield libs.

I would say the main painpoints of using a library here are all the conversions between different number sizes (and in the case of bitfields, structs) and the ergonomics (e.g. https://github.com/hecatia-elegua/bilge/issues/75).

Another nice byproduct of this RFC would be the constification of conversions between arbitrary ints by design. For some time I used the unstable const-trait-impl with const-from for converting between things, but that's being reworked now and not currently available.

hecatia-elegua avatar Sep 11 '24 12:09 hecatia-elegua

Hey, I'm the author of https://crates.io/crates/arbitrary-int . It seems like this proposal has some overlap with what I've built as a crate, so I can talk a bit about the hurdles I've run into.

Arbitrary-int has a generic type UInt<T, const BITS: usize>. T refers to the underlying type, which is usually the next larger native integer.

It also provides types to shorten the name. For example u3 becomes a shortcut for UInt<u8, 3>, u126 is a shortcut for UInt<u128, 126>. The types all pick the next largest built-in type, though it's also possible to specify this directly through e.g. UInt<u32, 6>.

It also provides a Number trait which unifies arbitrary-ints with regular ints, allowing completely generic programming over all of those types.

In general, implementing this as a crate worked pretty well, but there are some downsides:

  • Implementing From and TryFrom: This can't be done with the current type system. Once you implement From, the standard lib automatically adds TryFrom; I tried many ways of limiting this based on the size of the BITS involved but the type system always complains about multiple implementation. I think the specialization feature would allow me to address that, but it's unstable.
  • Lack of const generics: The other issue with From is that it's a trait which can't be used in const contexts; to work around that, arbitrary-int provides another function new which IS const and which allows creating a type directly.
  • No special syntax: As it's not a built-in feature, I can't provide custom suffixes like 1u4. So the shortest ways to create a number is generally u4::new(1). Not horrible, but not quite native.

danlehmann avatar Sep 11 '24 19:09 danlehmann

Hey, I'm the author of https://crates.io/crates/arbitrary-int . It seems like this proposal is very close to what I've built as a crate (I think arbitrary-int is also the closest to this rfc), so I can talk a bit about the hurdles I've run into.

Arbitrary-int has a generic type UInt<T, const BITS: usize>. T refers to the underlying type, which is usually the next larger native integer.

It also provides types to shorten the name. For example u3 becomes a shortcut for UInt<u8, 3>, u126 is a shortcut for UInt<u128, 126>. The types all pick the next largest built-in type, though it's also possible to specify this directly through e.g. UInt<u32, 6>.

It also provides a Number trait which unifies arbitrary-ints with regular ints, allowing completely generic programming over all of those types.

In general, implementing this as a crate worked pretty well, but there are some downsides:

* Implementing `From` and `TryFrom`: This can't be done with the current type system. Once you implement `From`, the standard lib automatically adds `TryFrom`; I tried many ways of limiting this based on the size of the BITS involved but the type system always complains about multiple implementation. I think the `specialization` feature would allow me to address that, but it's unstable.

* Lack of const generics: The other issue with `From` is that it's a trait which can't be used in const contexts; to work around that, arbitrary-int provides another function `new` which IS const and which allows creating a type directly.

* No special syntax: As it's not a built-in feature, I can't provide custom suffixes like `1u4`. So the shortest ways to create a number is generally `u4::new(1)`. Not horrible, but not quite native.

Also due to my design decision to base everything on a simple types (no arrays), the maximum number of bits supported is u127.

danlehmann avatar Sep 11 '24 19:09 danlehmann

Also due to my design decision to base everything on a simple types (no arrays), the maximum number of bits supported is u127.

I hadn't actually read the code yet, but I'm actually a bit curious why the max number of bits is 127 instead of 128. This feels like a weird restriction.

clarfonthey avatar Sep 12 '24 20:09 clarfonthey

Also due to my design decision to base everything on a simple types (no arrays), the maximum number of bits supported is u127.

I hadn't actually read the code yet, but I'm actually a bit curious why the max number of bits is 127 instead of 128. This feels like a weird restriction.

It is 128 bits actually. UInt<u128, 128> works just fine (though it is just somewhat useless as you might as well use the actual u128). My main point was that larger numbers aren't possible, unlike it e.g. ruint, which operates on arrays.

danlehmann avatar Sep 12 '24 20:09 danlehmann

By the way, I love this RFC! While arbitrary-int (as well as ux) provide the unusually-sized ints like u48 etc, having a built-in solution will feel more natural and allows to treat numbers in a much more unified fashion, which I'm looking forward to.

danlehmann avatar Sep 12 '24 20:09 danlehmann

(replying to https://github.com/rust-lang/rfcs/pull/3686#discussion_r1787372929 here so it's less likely to get lost when that thread gets resolved)

How often do you need an array with one element per bit?

when doing SIMD with bitmasks -- all the time. using Mask<T, N> and Simd<T, N> will be more common, but using [T; N] and u<N> will still be decently common. Mask<T, N> on many architectures is just a wrapper over u<N> (currently done with macros for a limited set of N), and Simd<T, N> is basically the same as [T; N] except with perhaps more alignment and better codegen.

In particular this is a big motivation for using usize for N in u<N>, since that's what all of Simd<T, N>, Mask<T, N>, and [T; N] need.

programmerjake avatar Oct 04 '24 10:10 programmerjake

I am not super familiar with core::simd, so please correct me if I misunderstood something. I understand that the N in Simd<T, N> and Mask<T, N> needs to be usize because both often occur in type signatures together and Simd<T, N> is closely tied to [T; N] in various ways. But mixing those types with u<N> seems much more niche, for reasons that I think are pretty fundamental.

From looking at the current implementation it seems that Mask<T, N> is a wrapper for Simd<T, N> on every target except x86 with AVX-512. That's what I would expect, since SIMD masks on most architectures really want to stay in SIMD registers or in dedicated mask registers when they exist -- and the latter are more like <i1 x N> than uN. Even AVX-512 has the k-regs, but for various reasons it's common to pass masks around as integers. Punning between masks and integers is sometimes useful, but it's not free (even on x86, even more so in the rest of the computing world) and should only be done very deliberately. That's one reason why Mask<T, N> and its equivalents in libraries like memchr and hashbrown have a lot of helper methods instead of just conversions to/from integers.

As long as masks are a separate type with platform-dependent representation, it seems to me that the interaction with u<N> is only skin-deep. It could simplify a few functions in the public API, but otherwise masks will still revolve around a cfg'd newtype or associated type. Users who want abstract over lane count will also be limited in how much they can equivocate between masks and integers: they may want a method like fn to_bits(self) -> u<N> on Mask<T, N> but it's not clear to me whether a return type like u<{N as u32}> would be much worse for them. Ideally there would be no casts anywhere, but as @scottmcm points out, choosing usize also requires casts for some reasonable use cases.

hanna-kruppe avatar Oct 04 '24 12:10 hanna-kruppe

Honestly, if the arguments for usize and u32 are equally compelling, I personally think the tie should be broken toward usize, for the simple reason that it is effectively the default type for generic constants. The number of bits makes sense as an unsigned size, and that's the full reasoning.

I do think that it would be worthwhile to choose one over the other if there actually were a lot of cases where it were valuable, and SIMD does feel compelling enough to be valuable, but honestly, it's kind of tied at this point.

I personally think that choosing u32 for a lot of the integer methods that involve bits was a mistake, but we're stuck with it for better or worse, and it feels like no matter what we do, there is going to be some inconsistency somewhere. So, might as well conceptually treat it as an "array of bits" IMHO.

Like realistically, if we were choosing an appropriate size for the count of bits, we should have chosen u16 instead of u32, since more than 8KiB in a single integer that isn't a big-integer feels kind of absurd. But again, hindsight doesn't really matter when deciding this; we just have to figure out a compromise.

clarfonthey avatar Oct 04 '24 17:10 clarfonthey

Using u16 would have the advantage that conversion to u32 and to usize is always lossless while the opposite direction isn’t. usize::from(x) isn’t a const expression today but the monomorphic case could be allowed without touching the whole “maybe-const trait bounds” subject. It’s still too wordy to seriously propose (without adding implicit widening, which is a whole other can of worms). But if we ever get usize: From<u32> then that would be a point in favor of u32 in my opinion.

Wild idea: is there anything forcing the u<N> and i<N> constructors to pick a single type for N? If they’re mostly compiler magic anyway, couldn’t they accept any unsigned integer? For example, can u<7u32> and u<7usize> be the same type? That might resolve most (not all) clashes of u32 vs usize in const generic APIs.

hanna-kruppe avatar Oct 04 '24 17:10 hanna-kruppe

Nevermind that last bit, the lang item might not have to pick a type for N, but all the impl blocks involving those types have to pick a type for their const generic parameter, so fn foo<const N: u8>(x: u<N>) couldn’t do anything with it’s argument.

hanna-kruppe avatar Oct 04 '24 17:10 hanna-kruppe

I mean, it would be nice if you could allow {N as T} expressions in const generics as a special case when there's no truncation, under the assumption that it's always unambiguous, but again, this runs into all the other problems with const generics and unification.

I don't think it'd be a good idea to delay deciding what the type of the parameter should be in this case, although we might end up getting the above anyway before this feature is actually implemented, since it's currently one of the project goals; see https://github.com/rust-lang/rust-project-goals/issues/100

clarfonthey avatar Oct 04 '24 18:10 clarfonthey

You mean carve it out as something that doesn't even need any of the implementation challenges to be solved because you could mostly ignore the cast completely? Might work, might cause horrible problems down the line, I don't know. But I'm already working under the assumption that "MVP generic const expressions" is a prerequisite for generic integers to be stabilized and widely used. Starting from a smaller type doesn't buy you anything in terms of language design or type system interactions,, but it's a nicer model for users (well, for me at least) because it means far fewer places where my "could this truncate?" spidey sense tingles.

hanna-kruppe avatar Oct 04 '24 18:10 hanna-kruppe