dataframe-api
dataframe-api copied to clipboard
Meaning of Column.offset?
Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used? If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?
Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?
(personally I don't really like that we use the same class for both ..)
The docstring seems to indicate it's indeed for chunks:
https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/dataframe_protocol.py#L119-L128
So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?
Ah, sorry I wasn't thinking about the case where your original data isn't chunked but you could return it in chunks.
Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer? Given that it says "may be > 0 if using chunks", it might actually be the second (your interpretation)
It was indeed meant for supporting an offset into a data buffer - this could be for chunking, or perhaps for other reasons like returning a subset of rows from the original dataframe/buffer and not wanting to create a new buffer.
So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?
Yes indeed. Although in practice I think chunks are normally coming from different buffers, because if all data fits in a single buffer then chunking isn't necessary.
Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?
Same basic principle, but Column.offset
is just a single value. Column.get_buffers
returns an "offsets" buffer that's for variable-length data:
- "offsets": a two-element tuple whose first element is a buffer
containing the offset values for variable-size binary
data (e.g., variable-length strings) and whose second
element is the offsets buffer's associated dtype. None
if the data buffer does not have an associated offsets
buffer.
Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?
That's not it, I hope the docstring is clear enough. If not, we should extend it.
(personally I don't really like that we use the same class for both ..)
Agreed, it was a bit of a compromise between "I want one class per concept" and "I want as few classes as possible" opinions.
Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer?
The latter.