dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Meaning of Column.offset?

Open maartenbreddels opened this issue 3 years ago • 5 comments

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used? If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?

maartenbreddels avatar Sep 16 '21 12:09 maartenbreddels

Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?

(personally I don't really like that we use the same class for both ..)

jorisvandenbossche avatar Sep 16 '21 12:09 jorisvandenbossche

The docstring seems to indicate it's indeed for chunks:

https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/dataframe_protocol.py#L119-L128

jorisvandenbossche avatar Sep 16 '21 12:09 jorisvandenbossche

So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?

maartenbreddels avatar Sep 16 '21 12:09 maartenbreddels

Ah, sorry I wasn't thinking about the case where your original data isn't chunked but you could return it in chunks.

Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer? Given that it says "may be > 0 if using chunks", it might actually be the second (your interpretation)

jorisvandenbossche avatar Sep 16 '21 12:09 jorisvandenbossche

It was indeed meant for supporting an offset into a data buffer - this could be for chunking, or perhaps for other reasons like returning a subset of rows from the original dataframe/buffer and not wanting to create a new buffer.

So, the simplest way to support chunking would be to always return the same buffer for a particular column, but different offset and length, right?

Yes indeed. Although in practice I think chunks are normally coming from different buffers, because if all data fits in a single buffer then chunking isn't necessary.

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?

Same basic principle, but Column.offset is just a single value. Column.get_buffers returns an "offsets" buffer that's for variable-length data:

            - "offsets": a two-element tuple whose first element is a buffer
                         containing the offset values for variable-size binary
                         data (e.g., variable-length strings) and whose second
                         element is the offsets buffer's associated dtype. None
                         if the data buffer does not have an associated offsets
                         buffer.

Alternative meaning could be that if you have a Column consisting of multiple chunks, the subset Column objects use the offset to indicate where in the parent Column they are?

That's not it, I hope the docstring is clear enough. If not, we should extend it.

(personally I don't really like that we use the same class for both ..)

Agreed, it was a bit of a compromise between "I want one class per concept" and "I want as few classes as possible" opinions.

Yeah, so that's indeed ambiguous in the spec: is the offset only informative for where the chunked Column fits in the full Column, or does it determine how to interpret the Buffer?

The latter.

rgommers avatar Sep 22 '21 11:09 rgommers