icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add CompactOptionULE

Open sffc opened this issue 3 years ago • 16 comments

OptionULE, as suggested in #1723, has an inefficient bit pattern. It would be nice to have an OptionULE type that could store a more efficient representation.

My suggestion was to introduce one of the following unsafe traits:

  • NonZeroULE: a ULE type that is never zero (all 0s)
  • NonMaxULE: a ULE type that is never the max integer value (all 1s)
  • NoHighBitULE: a ULE type where the high bit is never used

Let's take NonMaxULE. We would pair this with NonMaxOptionULE, which would use the anointed bit pattern to represent None.

I like NonMaxULE because it works for both TinyAsciiStr as well as enums that have fewer than 255 variants.

@Manishearth had some more ideas which he may write in a reply below.

sffc avatar Mar 24 '22 02:03 sffc

Overall I'm not much in favor of such a system that only selects one bit pattern as a niche, nor really having multiple types. Rust's uncustomizeable NonZero has been a source of much frustration in this space and I'd rather not replicate it.

I think a good path forward would be to have something like

unsafe trait InvalidBytes {
   const INVALID_BIT_PATTERN: &[u8]; // required to have len == sizeof(self)
}

Even better, we could have the bit pattern be specified by:

enum InvalidBitPattern {
    MaxBytes.
    ZeroBytes,
    Bytes(&[u8])
    Sequence(&[&InvalidBitPattern])
}

which we can propagate in the custom derive (have an attribute saying that a field implements InvalidBitPattern, automatically makes type implement it too using Sequence)

Manishearth avatar Mar 28 '22 15:03 Manishearth

@Manishearth What is the size for this issue? If it is small, we could do it in 1.0, but if it's nontrivial, we should sit on it.

sffc avatar May 26 '22 23:05 sffc

Nontrivial since I think there's some work involved in getting a good design, and implementing it.

Manishearth avatar May 26 '22 23:05 Manishearth

OK, let's put it in 1.1 then.

Is this a good first/second issue?

sffc avatar May 27 '22 16:05 sffc

good second issue for someone with a strong rust background (comfort in rust code), I think.

Manishearth avatar May 27 '22 19:05 Manishearth

@pdogr I think there's time for ICU4X 1.0 to optimize the storage at least for the ULE form of Option TinyAsciiStr. @robertbastian Could you advise on this?

sffc avatar Aug 22 '22 23:08 sffc

Note that the all-1 bit-pattern is actually not the maximum value for signed integers.

robertbastian avatar Aug 31 '22 10:08 robertbastian

After discussing with @robertbastian we came up with an initial design but are kinda out of ideas.

The idea was to have a trait NicheBytes which specified the bytes to use as niche.

/// N == core::mem::sizeo_of::<Self>()
pub trait NicheBytes<const N: usize> {
    const INVALID_BIT_PATTERN: [u8; N];
}

The ULE type for Option<U> in this case would be a union NichedOptionULE<U, N>.

#[repr(packed)]
pub union NichedOptionULE<U: NicheBytes<N> + ULE, const N: usize> {
    invalid: [u8; N],
    valid: U,
}

The issue is that currently there is a blanket impl for Option<T> which defaults to using OptionULE<T::ULE>.

impl<T: AsULE> AsULE for Option<T> {
    type ULE = OptionULE<T::ULE>;
}

To get around this we put this blanket impl behind a marker trait NoNiche

pub trait NoNiche {}
impl<T: AsULE> AsULE for Option<T>
where
    T::ULE: NoNiche,
{
   type ULE = OptionULE<T::ULE>;
}

But now the compiler complains whenever we try to have a specific AsULE impl for Option<T> saying that there are conflicting implementations for AsULE. Probably the compiler can't guarantee that these traits are mutually exclusive without using negative trait bounds.

pdogr avatar Sep 02 '22 09:09 pdogr

At least for now, I think the NicheOptionULE should show up in ZeroVec bounds. The blanket impl on Option should stay, at least for now (but we can start migrating ICU4X users over to the new model).

NicheOptionULE should have a function like get that returns a regular option, which you have to invoke after getting the value out.

sffc avatar Sep 02 '22 15:09 sffc

Raw ULE types don't show up in ZeroVec bounds though. Do you mean a newtype NichedOption and its associated ULE type NichedOptionULE?

I don't think that "migrating ICU4X users over to the new model" is something we can do. We cannot make the ULE representation depend on whether there's a niche, so we'll need both OptionULE and NichedOptionULE (and because of negative trait limitations two non-ULE types) going forward.

robertbastian avatar Sep 02 '22 15:09 robertbastian

Types that are their own ULE show up in ZeroVec bounds, like TinyAsciiStr. I think it's not unreasonable for NichedOptionULE to be parameterized on AsULE instead of ULE and then be its own ULE type.

However, I'm warming up to the idea that we should make NichedOption to feed into NichedOptionULE.

In ICU4X, we should use NichedOption in new data structs, and when making new versions of old ones, and if this lands in the next week or so, migrating existing users over to NichedOption before 1.0 Final.

After NichedOption lands, we can discuss whether to remap the Option blanket impl, and introduce like BloatedOption for the old byte layout.

sffc avatar Sep 02 '22 15:09 sffc

Yeah, I think the nicest solution is NichedOption that has a Deref to regular Option.

sffc avatar Sep 02 '22 15:09 sffc

So something like

/// N == core::mem::sizeo_of::<Self>()
pub trait NicheBytes<const N: usize>: ULE {
    const INVALID_BIT_PATTERN: [u8; N];
}

pub struct NichedOption<T: AsULE>(Option<T>);

impl<T> Deref for NichedOption<T> {
  type Target = Option<T>;
}

#[repr(packed)]
pub union NichedOptionULE<U: NicheBytes<N>, const N: usize> {
    invalid: [u8; N],
    valid: U,
}

impl<T: AsULE, const N: usize> AsULE for NichedOption<T>
  where T::ULE: NichedBytes<N>
{
    type ULE = NichedOptionULE<T::ULE>;
    
    // ...
}

and usage would be

impl<const N: usize> NicheBytes<N> for TinyAsciiStr<N> {
  const INVLID_BIT_PATTERN: [u8; N] = [0xFF; N];
}


pub struct Foo<'a> {
  bar: ZeroVec<'a, NicheOption<TinyAsciiStr>,
}

robertbastian avatar Sep 02 '22 15:09 robertbastian

@pdogr yes, you will not be able to have multiple impls, it will have to be a separate type unless we want to completely replace the impl (which we can't). I think that's fine.

Note that you can get around needing the const N if you do the &[u8] thing I suggested instead. The trait has to be unsafe either way so requiring implementors to get the length right doesn't seem too bad.

Manishearth avatar Sep 02 '22 16:09 Manishearth

The union cannot be defined without N, as <U as InvalidBytes>::INVALID_BIT_PATTERN.len() is not evaluable.

robertbastian avatar Sep 02 '22 16:09 robertbastian

I also think it's a nice hint to the implementor.

robertbastian avatar Sep 02 '22 16:09 robertbastian

Fixed by https://github.com/unicode-org/icu4x/pull/2501

Manishearth avatar Feb 17 '23 22:02 Manishearth