dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Null description in case of no missing value

Open iskode opened this issue 4 years ago • 4 comments

Validity mask is a missing value representation that depends on the Column in the protocol. If describe_null() is meant to describe missing values at the column level for a given dtype.

In the case there is no missing values, shall we still provide a validity array with 1 (valid) at all entries or shall we raise an exception ? From my perspective, the later is better because we can just check null_count == 0 without allocating and filling the whole array with the same value. That is how it works in cudf dataframe. If there is no missing value, accessing the attribute nullmask (which holds the validity array) raises an exception with message: "Column has no null mask".

iskode avatar Aug 27 '21 15:08 iskode

If the native representation does not have a mask - either because there are no missing values or because it encodes them as nan in the data - then it should not be necessary to create an "all ones" buffer for it. However, rather than checking null_count, I believe this should come from describe_null() - if that says there's a bit mask or byte mask, then it should actually be there independent of if there's zero or more null values.

An optimization path is still possible then right? The implementation can put the null_count check inside describe_null.

rgommers avatar Sep 01 '21 14:09 rgommers

ok, thank you for your explanation.

An optimization path is still possible then right? The implementation can put the null_count check inside describe_null.

I agree so describe_null should behave as follow: if there is no missing value, we should return value = 0 meaning the column is non-nullable thus do not have any mask array. Otherwise, we return value = 3 as bit mask is universally used for any dtype in cudf.

def describe_null():
     if self.null_count == 0:
          null = 0
          value = None
     else:
          null = 3
          value = 0
    ...
    return null, value

iskode avatar Sep 10 '21 12:09 iskode

Actually we had a conversation about this issue last night, and the preference was for having None be a valid return value when accessing the validity buffer in case no values are missing.

rgommers avatar Sep 10 '21 12:09 rgommers

I think this is already the case when null == 0. From the pandas implementation, in _get_validity_buffer, we have:

elif null == 0:   
     msg = "This column is non-nullable so does not have a mask"
else:
     raise NotImplementedError("See self.describe_null")
raise RuntimeError(msg)

And, in get_buffers, we have:

try:
     buffers["validity"] = self._get_validity_buffer()
except:
     buffers["validity"] = None

iskode avatar Sep 10 '21 14:09 iskode