vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Add a "bytes" DType

Open gatesn opened this issue 1 year ago • 4 comments

Since dtype is logical type, we should distinguish between uint8 and a byte (with underlying u8 ptype).

This will allow us to perform different compression strategies. e.g. not much point trying bitpicking for an arbitrary byte array.

gatesn avatar Feb 06 '24 10:02 gatesn

Interesting. I thought the encoding provides differentiation. Arbitrary byte array will be some kind of varbin so varbin when recursing on its children will know what kind of compressions make sense

robert3005 avatar Feb 06 '24 10:02 robert3005

Encoding == physical, dtype == logical.

By the same argument, you could say we should have Timestamp encoding and not a Timestamp dtype since it's just an int64 underneath. But I'm not sure that would be right?

Same for UTF8

gatesn avatar Feb 06 '24 10:02 gatesn

I don’t think the same reasoning leads the result you arrived at. Arbitrary binary type has different logical type already, same for timestamp. If you’d like to apply the same logic you’d have to add seconds/minute/hours and so on dtypes which I’m not advocating for. Physical encoding shouldn’t leak into logical layer imho

robert3005 avatar Feb 06 '24 10:02 robert3005

Hmmm, yeah there's a difference between an array of binary values VarBin(Bytes), and a binary array.

Ok, yep, let's instead have the encodings configure the compression ctx based on what they know. e.g. the GCD of 3600 we mentioned.

gatesn avatar Feb 06 '24 10:02 gatesn