dataframe-api
dataframe-api copied to clipboard
Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values
I am currently working on the implementation of the dataframe interchange protocol for PyArrow. After testing the current PyArrow implementation for producing a __dataframe__
object with Pandas implementation for consuming I have noticed that columns that use bit/bytemask null representation, but do not have missing values, error.
The reason for this is that Apache Arrow does not create a mask buffer when there are no missing values present. Therefore the result of calling .get_buffers()["validity"]
on the PyArrow __dataframe__
object without missing values is None
which is currently not handled by the protocol specification. See:
https://github.com/pandas-dev/pandas/blob/5c66e65d7b9fef47ccb585ce2fd0b3ea18dc82ea/pandas/core/interchange/from_dataframe.py#L502
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None
instead of a buffer.
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.
If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.
For onlookers, the relevant docs for what buf, dtype = Column.get_buffers()["validity"]
currently should contain
https://github.com/data-apis/dataframe-api/blob/aa6fe7d7bc4fd6fd24b8dd6b4dfb8c58cac2d8b9/protocol/dataframe_protocol.py#L353-L357
For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.
If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.
That's certainly a possible solution, but I personally find that it feels a bit wrong. The column is nullable, in the meaning that it "can" have nulls (that's typically how "nullable" is interpreted, I think). The null count just happens to be 0, in which case arrow can optimize this by not allocating the bitmask. Also for a datetime64 column, you probably won't change the null type from USE_SENTINEL to NON_NULLABLE if there are no nulls (NaT) present (although of course here it has no impact on the memory layout).
One corner case where this fallback to non-nullable doesn't necessarily work optimally is that a column can have multiple chunks, and in pyarrow, one chunk might have a null bitmap, and a next chunk might not have one.