vector Length in unboxed instances

Unbox instances for tuples (defined in unbox-tuple-instances) are backed by vectors of the form:

data instance Vector (a, b)
    = V_2 {-# UNPACK #-} !Int !(Vector a) !(Vector b)

This stores the length of each vector. Is there a fundamental reason why we need this length? It seems, naively, like it wastes an extra word as each vector already contains the length and an invariant of the representation should be that both lengths remain identical.

Jul 20 '21 00:07 Boarders

We could define

instance (Generic.Vector Unboxed.Vector a, Generic.Vector Unboxed.Vector b) Generic.Vector Unboxed.Vector (a,b) where
   basicLength (V_2 {- length field is gone -} vfst vsnd) = basicLength vfst

However what if we had Unboxed.Vector ((...((a10000,a9999),a9998),....,a1),a0)? It would be a BIG performance penalty for merely asking for a length. The reason for this weird iterated nested pair may be:

{-# LANGUAGE DataKinds, KindSignatures, TypeFamilies, DerivingStrategies #-}
data Nat = Zero | Succ !Nat
data family SnocLenList (a::Type) :: Nat -> Type
newtype instance SnocLenList a Zero = SLLNil ()
  deriving newtype (Generic.Mutable.MVector Unboxed.Mutable.MVector, Generic.Vector Unboxed.Vector, Unboxed.Unbox)
newtype instance SnocLenList a (Succ n) = SLLSnoc (SnocLenList a n,a)
  deriving newtype (Generic.Mutable.MVector Unboxed.Mutable.MVector, Generic.Vector Unboxed.Vector, Unboxed.Unbox)

Jul 20 '21 00:07 gksato

Is it a good trade off to optimize for type-level lisp in the library?

Jul 20 '21 00:07 Boarders

Good point. With many vectors around, those bytes might add up to a noticeable waste. How many nested vectors are needed for these extra bytes to be worth it?

The example with the type-level naturals could be rewritten using the type-level naturals from GHC.TypeLits, using natVal to get the length.

Jul 20 '21 00:07 Daniel-Diaz

My example is surely an extreme one and I neither think the library should take utmost care for this kind of extremity. However pointer indirection is always a small but non-ignorable time performance penalty, and the most frequently invoked operation would be length if a user doesn't love unsafe functions and C/C++-like nasal demons, i.e. undefined behavior of the whole program. On the other hand, extra one word for length is a space performance penalty, and we can't simply compare these two types of penalty, while they are connected through GC.

Jul 20 '21 02:07 gksato

I agree that it's problem but trying to get rid of length field in V_2 and friends is wrong approach IMO. Let take for example 3-tuple (A,B,C). Representation of corresponding unboxed vector is roughly:

data V_3 A B C = V_3 !Length
  (Length, Buffer A)
  (Length, Buffer B)
  (Length, Buffer C)

We have 4 copies of length! And removing length field from V_3 does little^ it only removes 1 out of 3. What we really want is to separate buffer from length so we can represent product as

data V_3 A B C = V_3 !Length
  (Buffer A)
  (Buffer B)
  (Buffer C)

Jul 20 '21 08:07 Shimuuar

What kind of structure would that be....

data family UnboxedBufWithOff a
data instance UnboxedBufWithOff Int = UBOInt {-# UNPACK #-} !Int {-#UNPACK #-} !ByteArray
data instance UnboxedBufWithOff Word32 = UBOWord32 {-# UNPACK #-} !Int {-# UNPACK #-} !ByteArray
data instance UnboxedBufWithOff (a,b,c) = UBOTup3 !(UnboxedBufWithOff a) !(UnboxedBufWithOff b) !(UnboxedBufWithOff c)

data Unboxed.Vector a = Unboxed {-# UNPACK #-} !Int !(UnboxedBufWithOff a)

This is an extra pointer indirection. Maybe we would want:

data family Unboxed.Vector a
data instance Unboxed.Vector Int = UnboxedInt {-# UNPACK #-} !Int {-# UNPACK #-} !(UnboxedBufWithOff Int)
data instance Unboxed.Vector Word32 = UnboxedWord32 {-# UNPACK #-} !Int {-# UNPACK #-} !(UnboxedBufWithOff Word32)
data instance Unboxed.Vector (a,b,c) = UnboxedTup3 {-# UNPACK #-} !Int {-# UNPACK #-} !(UnboxedBufWithOff (a,b,c))

This is a big incompatibility.

Jul 20 '21 09:07 gksato

Ah, I forgot to say this. Good point.

Jul 20 '21 09:07 gksato

Yes something along these lines. It is incompatible and it's not clear how it will interact with GND/DerivingVia but thankfully this sort of changes could be experimented with outside of vector

Event less compatible change I though about is to abandon data families in favor of something like:

data Vector a = Vector !Int (Buffer a)
type family Buffer a

We only have few variants for representation of elements: primitive vectors, products, and represent me as another type.

Jul 20 '21 09:07 Shimuuar

@Shimuuar Ah yes, that would be way better than removing the one length, thanks for the insight. I think it is definitely worth thinking about how it could be done better since currently it seems like a waste. If one were to change things to make this more transparent then I think products could already be done with generics which would probably clear up many issues with GND anyway.

Jul 20 '21 13:07 Boarders

@Shimuuar Yes, indeed removing the duplication of length for each sub-vector is the right optimization.

Jul 20 '21 13:07 Daniel-Diaz

@Shimuuar The construction is beautiful and simple. However can Buffer a be UNPACKed in Vector a that way? I mean, if we can't, it's ~~one~~ two word extra for non-product Vector: 4 words of

[Unboxed Vector tag] [Length Int#] [Offset Int#] [Pointer to ByteArray#]

becomes ~~5=2+3~~ 6=3+3 words of

Vector: [Unboxed Vector tag] [Length Int#] [Pointer to Buffer]
Buffer: [Buffer tag] [Offset Int#] [Pointer to ByteArray#]

We were talking about wasting a word in memory, right? Or did I misunderstand and were we talking about having duplicate storage of the same logical information?

Jul 22 '21 02:07 gksato

That's very good point! I think that unboxing is possible when buffer type is monomorphic although that should be checked.

Jul 22 '21 07:07 Shimuuar