libs-team icon indicating copy to clipboard operation
libs-team copied to clipboard

Integer Manipulation API

Open asquared31415 opened this issue 1 year ago • 11 comments

Proposal

Problem statement

Logically there are 6 different behaviors that a conversion between two integers may have:

  1. reinterpret bits
  2. truncate bits
  3. zero extend bits
  4. sign extend bits
  5. keep numerical value and saturate if out of range
  6. keep numerical value and panic if out of range

as-casts implement the first four of these possible behaviors, but can only express one of these behaviors for each pair of types T as U. TryFrom can express behaviors 5 and 6 with the help of some extra code.

This API aims to implement all of behaviors 1 through 4 on every possible pair of integer types, using code that more directly expresses the desired behavior and can be combined to express more behaviors.

Motivation, use-cases

Currently, converting between integer types can be done in two ways:

  1. as-casts, which have a well defined effect for each pair of types
  2. Manual bit manipulation to get the bits the way that you want, then using as casts or the {to,from}_bytes APIs on different integer types, which also involves making an array larger/smaller

Option 1 uses as, which can be undesirable due to its leniency in input types, willingness to silently change behavior if types change, and restricted sets of behaviors.

Option 2 requires manual bit manipulation, even when that manipulation shouldn't need to be complicated. Even worse it requires expanding or shrinking an array, which is difficult to do concisely.

The ability to express any combination of truncation, zero extending, sign extending, and bit reinterpretation with code that can be checked to do the correct behavior at compile time is better than either of these current solutions, even if it is more wordy.

Use cases:

// sign extending `val` from an i16 to a u32

// works fine, but you have to know all the `as` behaviors
// if `val` changes to a i64, this silently truncates
val as u32; 

// declares what behavior it wants
// if `val` changes to i64, this no longer compiles
// "extending" to a smaller type does not make sense
val.sign_extend::<i32>().cast::<u32>();
// zero extending `val` from an i16 to a u64

// unclear why this goes through u16
// if `val` changes to i32 this no longer zero extends, it truncates in the middle
val as u16 as u64; 

// declares what behavior it wants
// if `val` changes to i32 it compiles and continues to zero extend
val.zero_extend::<i64>().cast::<u64>();
// The dangers of `as` when used carelessly

fn convert(x: u32) -> i32 {
    x as _ // reinterprets
}
// changes to:
fn convert(x: u32) -> i64 {
    x as _ // now does a zero extension because the inferred type changed!
}


// The new API adds guarantees about what operations happen

fn convert(x: u32) -> i32 {
    x.cast() // reinterprets
}
// changes to:
fn convert(x: u32) -> i64 {
    x.cast() // does not compile, the sizes are not the same
}

Solution sketches

In each of these examples, assume that Self is an integer type and that the target type U is also an integer.

(name might need improvement, bit_cast?)

fn cast<U>(self) -> U
  • Converts one integer type into an integer type with the same size by bit casting
  • Only exists for pairs of integers with the same size (i8 -> u8, u128 -> i128, etc) because that's the only unambiguous bit cast behavior
    • COUNTEREXAMPLES: u8 -> i16 or i32 -> i64
  • Does not exist to increase size of integers, use zero_extend or sign_extend instead (should be documented)
  • Does not exist to decrease size of integers, use truncate instead (should be documented)
  • Does not preserve numerical value (should be documented)
  • The identity cast is supported (u8 -> u8), even though it's not very useful
fn zero_extend<U>(self) -> U
  • Extends an integer type into a larger integer type by filling in the high bits with zeros
  • Only exists for pairs of integers where the target type is strictly larger than the self type and the signedness does not change (u8 -> u16, i32 -> i64, etc)
  • Does not exist for same size, smaller size, or changed signs
    • COUNTEREXAMPLES: i8 -> u16, u8 -> i8, u64 -> u32
  • Never preserves numerical value for signed types (this should be documented with a big noticible red flashy block)
  • Always preserves numerical value for unsigned types
fn sign_extend<U>(self) -> U
  • Extends an integer type into a larger integer type by filling in the high bits with copies of the sign bit of the self type
  • Only exists for pairs of signed integers where the target type is strictly larger than the self type (i8 -> i16, i16 -> i128)
  • Does not exist for equal or smaller target types, does not exist to change sign
    • COUNTEREXAMPLES: i8 -> i8, i64 -> i16, i8 -> u128
  • Does not exist for unsigned integers (there's no sign to extend), use zero_extend instead (this should be documented)
  • Always preserves numerical value as a result of integers using 2's complement (this should be documented)
fn truncate<U>(self) -> U
  • Converts from one integer type into a smaller integer type by truncating the high bits
  • Only exists for pairs of integers where the target type is strictly smaller than the self type and where the signedness of the integers does not change (u64 -> u16, i128 -> i32, etc)
  • Does not exist for same size or larger target types as there is no truncating operation, use zero_extend or sign_extend instead (should be documented)
  • Does not exist for converting signs, use cast instead (should be documented)
  • Does not necessarily preserve numerical value for any types (should be documented)
  • The reason for not allowing signedness changes is to prevent some surprising behavior. For example -1_i16 truncated to u8 directly via as-casting would be 255_u8. All sign changing behavior should be done with cast: -1_i16.truncate::<i8>().cast::<u8>() == 255_u8

Interactions with usize and isize

usize and isize have target dependent widths which complicates interactions with them. In the interest of making the methods consistent between targets and not introduce more surprising behavior, usize and isize will only be able to be truncated to a u8 or i8 and will only be able to be extended from a u8 or i8. The cast method will consider usize and isize to be the same size as each other, but not the same size as any other type (even if that is true on this target). See below or the full implementation list for more details. The reasoning behind this specific choice is because the minimum possible size for usize and isize is 16 bits but there is no maximum size. usize and isize therefore cannot be reliably truncated to any type larger than 8 bits (they might not be large enough to truncate) and may not be extended into from any type larger than 8 bits (they might not be large enough to hold the source). usize and isize may not be extended into any type because the target type cannot reliably be larger than the source. Even though usize and isize are always at least 16 bits, they do not have the operation to truncate to 16-bit integers or extend from 16-bit integers because these operations may be a no-op in some cases, but not others.

Supported Operations

Click to open (warning: long)

cast:

  • u8 -> u8
  • i8 -> i8
  • u8 -> i8
  • i8 -> u8
  • u16 -> u16
  • i16 -> i16
  • u16 -> i16
  • i16 -> u16
  • u32 -> u32
  • i32 -> i32
  • u32 -> i32
  • i32 -> u32
  • u64 -> u64
  • i64 -> i64
  • u64 -> i64
  • i64 -> u64
  • u128 -> u128
  • i128 -> i128
  • u128 -> i128
  • i128 -> u128
  • usize -> usize
  • usize -> isize
  • isize -> isize
  • isize -> usize

zero_extend:

  • u8 -> u16
  • u8 -> u32
  • u8 -> u64
  • u8 -> u128
  • i8 -> i16
  • i8 -> i32
  • i8 -> i64
  • i8 -> i128
  • u16 -> u32
  • u16 -> u64
  • u16 -> u128
  • i16 -> i32
  • i16 -> i64
  • i16 -> i128
  • u32 -> u64
  • u32 -> u128
  • i32 -> i64
  • i32 -> i128
  • u64 -> u128
  • i64 -> i128
  • u8 -> usize
  • i8 -> isize

sign_extend:

  • i8 -> i16
  • i8 -> i32
  • i8 -> i64
  • i8 -> i128
  • i16 -> i32
  • i16 -> i64
  • i16 -> i128
  • i32 -> i64
  • i32 -> i128
  • i64 -> i128
  • i8 -> isize

truncate:

  • u16 -> u8
  • i16 -> i8
  • u32 -> u8
  • u32 -> u16
  • i32 -> i8
  • i32 -> i16
  • u64 -> u8
  • u64 -> u16
  • u64 -> u32
  • i64 -> i8
  • i64 -> i16
  • i64 -> i32
  • u128 -> u8
  • u128 -> u16
  • u128 -> u32
  • u128 -> u64
  • i128 -> i8
  • i128 -> i16
  • i128 -> i32
  • i128 -> i64
  • usize -> u8
  • isize -> i8

Comparison to as-casts

For reference, here are the behaviors of as-casts on integer types:

  • Casting between two integers of the same size (e.g. i32 -> u32) is a no-op (Rust uses 2's complement for negative values of fixed integers)
  • Casting from a larger integer to a smaller integer (e.g. u32 -> u8) will truncate
  • Casting from a smaller integer to a larger integer (e.g. u8 -> u32) will
    • zero-extend if the source is unsigned
    • sign-extend if the source is signed

All current possible operations using as-casts can be replicated with at most two of these functions chained together except for certain operations with usize and isize (see above). Examples (all types explicitly documented, type inference may make this cleaner):

  • u32 as i32 becomes u32.cast::<i32>()
  • i16 as u8 becomes i16.truncate::<i8>().cast::<u8>()
  • u128 as u32 becomes u128.truncate::<u32>()
  • u32 as i64 becomes u32.zero_extend::<u64>().cast::<i64>()
  • i64 as u128 becomes i64.sign_extend::<i128>().cast::<u128>()
    • however this API allows for the following alternate behavior, which is only possible via multiple as-casts
    • i64.zero_extend::<i128>().cast::<u128>() which zero extends the value rather than sign extends. The equivalent behavior with as-casts is i64 as u64 as u128

Links and related work

~~Entirely supersedes https://github.com/rust-lang/libs-team/issues/183. to_signed and to_unsigned are representable with cast.~~ edit: this is not entirely true, there's specific macro cases that are much harder to represent.

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

asquared31415 avatar Apr 05 '23 23:04 asquared31415

imho to_signed/to_unsigned are still useful because they convert to the appropriate type regardless of type width, this is useful for macros where you want to be generic over type width but not type signed-ness.

programmerjake avatar Apr 06 '23 00:04 programmerjake

Can you give a concrete example of code that you'd like to work with that? As far as I am aware, my proposed API is able to perform the same operations.

asquared31415 avatar Apr 06 '23 01:04 asquared31415

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=44ac01c70f01f2f78863120f3bf31c01

pub struct BitVec<UnsignedIntTy>(UnsignedIntTy);

pub trait ShiftOps {
    fn shl(self, amount: u32) -> Self;
    fn shr(self, amount: u32) -> Self;
    fn ashr(self, amount: u32) -> Self;
}

macro_rules! impl_shift_ops {
    ($t:ty) => {
        impl ShiftOps for BitVec<$t> {
            fn shl(self, amount: u32) -> Self {
                Self(self.0.wrapping_shl(amount))
            }
            fn shr(self, amount: u32) -> Self {
                Self(self.0.wrapping_shr(amount))
            }
            fn ashr(self, amount: u32) -> Self {
                Self(self.0.to_signed().wrapping_shr(amount).to_unsigned())
            }
        }
    };
}

impl_shift_ops!(u8);
impl_shift_ops!(u16);
impl_shift_ops!(u32);
impl_shift_ops!(u64);
impl_shift_ops!(u128);
impl_shift_ops!(usize);

programmerjake avatar Apr 06 '23 03:04 programmerjake

Two changes I would make to this proposal:

  • I would there be a single extend method, which would always preserve the value. I don't see the utility of zero_extend for signed values wherein the value would change. If a user truly wants this, they can do (-1i8).cast::<u8>().extend::<u16>().cast::<i16>()
  • Identity conversions should always be allowed. This would be useful for macros.

The naming of cast isn't the greatest, but I don't have any immediate ideas.

jhpratt avatar Apr 13 '23 19:04 jhpratt

The naming of cast isn't the greatest, but I don't have any immediate ideas.

The only thing I can think of is reinterpret, but you could split it into two functions depending on sign:

impl u8 {
    fn as_signed(self) -> i8;
}
impl i8 {
    fn as_unsigned(self) -> u8;
}

pitaj avatar Apr 13 '23 20:04 pitaj

Pointers have cast_mut and cast_const to go back and forth between *const T and *mut T without the possibility of accidentally changing the pointee.

Having cast_signed and cast_unsigned for integers seems plausible, then.

And that same "this is just changing signedness not the width" sounds handy. Jacob's example above ((-1i8).cast::<u8>().extend::<u16>().cast::<i16>()) would then be .cast_unsigned().extend::<u16>().cast_signed(), without needing to repeat those extra types.

scottmcm avatar Apr 14 '23 06:04 scottmcm

I made an alternative RFC: Traits for lossy numeric conversions

I wasn't aware of this issue, but I would like to exchange ideas and improve my RFC if necessary/desired.

Aloso avatar Apr 14 '23 21:04 Aloso

Several points to consider:

  1. zero_extend, sign_extend and truncate changing signedness should be perfectly fine, as they're operating on bit representations in memory. You wouldn't normally draw parallels between numeric formatting in this context. It's the implicit choice that makes as casts bad in this case. You could choose between src and dest to base extension on, and it isn't reflected anywhere in the code.
  2. cast::<T> having a generic is unnecessary. Introducing generics should be done when there is more than a single type that could be in place of them. I'd either leave it at cast or introduce methods such as to_{integer}. For a macro use case, it's possible to store intermediate value and annotate its type before attempting to cast.
  3. -1_i16.truncate::<i8>().cast::<u8>() in the example won't compile, as minus sign is not a part of literal, but applied to an entire expression. Not an issue with -1_i16 as i8 as u8 due to a lower precedence of as. Now imagine the same situation but with u16 => u8 => i8.
  4. Overall I'd say methods are still a bit too wordy. as is in this weird place where a couple of lints against common mistakes combined with more clear extension operators would solve most of its problems whilst retaining its convenience. As an alternative, there could be a sort of "typed as" operator that has all the powers of as, but limited by explicit block expression context (i.e. bitcasts, lossless conversions, etc.) and additionally provided arguments (zero/sign extend).

Kolsky avatar Mar 30 '24 10:03 Kolsky

Apparently I forgot to leave a comment about this before. I released num-conv a few months ago with what I felt was a reasonable API based on the original issue and my proposed changes. I have been using it as a dependency of time since its release without issue. The only open question on my end is how to handle usize and isize.

jhpratt avatar Mar 30 '24 19:03 jhpratt

So I came up with a quick idea and implemented it in yabe. As it turned out, it's similar to num-conv, but it also remains terse. My personal gripes above primarily stem from the fact that writing this much code is going to be a punishing experience similar to using unsafe whereas bit twiddling should be the norm in systems programming. Though having the ability to write a hacky crate (arguably too magical for std) alleviates the problem.

In other words, the sketch proposed here may remain as-is as long as it doesn't outright forbid as casts, since it should be a stylistic preference and will be affected by the frequency of use.

Kolsky avatar Mar 31 '24 04:03 Kolsky

I think part of this got approved in https://github.com/rust-lang/libs-team/issues/359#issuecomment-2033209931 as cast_(un)signed?

Still worth continuing the discussion here for the other parts.

scottmcm avatar Apr 16 '24 19:04 scottmcm