arrow
arrow copied to clipboard
[Format] Physical representation of columnar format not well documented
Describe the enhancement requested
Currently the columnar format is only documented at this page: https://arrow.apache.org/docs/format/Columnar.html. However, when I try to actually implement the format, I find the physical representation underdocumented.
Particularly, the encoding of primitive types is unclear. The only info given is an example int32 layout, but no other layouts are given, while other type are unclear. How are booleans represented, for example? Do implementation choose what representation they use? I suppose that's not the case as it will defeat Arrow's goal.
I was pointed to https://github.com/apache/arrow/blob/main/format/Schema.fbs for reference. However, as far as I understand, this specification is only for the IPC schema. It includes specification of type information, but when it comes to physical representation, there's only struct Buffer with a length and offset.
I would like a clear documentation of the memory layout of every type supported by Arrow. An example specification I can think of is CTF, which provides not only layouts of all types, but also side-by-side examples of schema, layout, and values. Similar documentation will be immensely helpful for Arrow, especially showing layouts of various array types.
Component(s)
Format