storing & exchange of categorical dtypes

Open rgommers opened this issue 4 years ago • 0 comments

Categorical dtypes

xref gh-26 for some discussion on categorical dtypes.

What it looks like in different libraries

Pandas

The dtype is called category there. See pandas.Categorical docs:

>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")

>>> df.dtypes
A       int64
B    category
dtype: object

>>> col = df['B']
>>> col.dtype
CategoricalDtype(categories=[1, 2, 5], ordered=False)

>>> col.values.ordered
False
>>> col.values.codes
array([0, 1, 2, 0], dtype=int8)
>>> col.values.categories
Int64Index([1, 2, 5], dtype='int64')
>>> col.values.categories.values
array([1, 2, 5])

Apache Arrow

The dtype is called _"dictionary-encoded" in Arrow - so a dataframe with a categorical dtype is called a "dictionary-encoded array" there. See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions for details.

A practical example (from @kkraus14 in gh-38), for a categorical column of ['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of ['gold' < 'silver' < 'bronze']:

categorical column: {
    mask_buffer: [119], # 01110111 in binary
    data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
    children: [
        string column: {
            mask_buffer: None,
            offsets_buffer: [0, 4, 10, 16],
            data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
        }
    ]
}

struct ArrowSchema {
  // Array type description
  const char* format;
  const char* name;
  const char* metadata;
  int64_t flags;
  int64_t n_children;
  struct ArrowSchema** children;
  struct ArrowSchema* dictionary;  // the categories
  ...
};

struct ArrowArray {
  // Array data description
  int64_t length;
  int64_t null_count;
  int64_t offset;
  int64_t n_buffers;
  int64_t n_children;
  const void** buffers;
  struct ArrowArray** children;
  struct ArrowArray* dictionary;
  ...
};

Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.

Vaex

EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.

>>> import vaex
... >>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
... >>> df = df.categorize('year', min_value=2020, max_value=2019)
... >>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fr
... i', 'Sat', 'Sun'])
>>> 
>>> df.dtypes
year       int64
weekday    int64
dtype: object
>>> df.is_category('year')
True
>>> df.is_category('weekday')
True
>>> df._categories
{'year': {'labels': [], 'N': 0, 'min_value': 2020}, 'weekday': {'labels': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 'N': 7, 'min_value': 0}}

Other libraries

Modin follows Pandas
Dask follows Pandas
Koalas does not support categorical dtypes at all

Exchange protocol

This is the current form in gh-38 for the Pandas implementation of the exchange protocol:

>>> col = df.__dataframe__().get_column_by_name('B')
>>> col
<__main__._PandasColumn object at 0x7f0202973211>
>>> col.dtype  # kind, bitwidth, format-string, endianness
(23, 64, '|O08', '=')

>>> col.describe_categorical  # is_ordered, is_dictionary, mapping
(False, True, {0: 1, 1: 2, 2: 5})

>>> col.describe_null  # kind (2 = sentinel value), value
(2, -1)

Changes needed & discussion points

What we already determined needs changing:

Add get_children() method, and store the mapping that is now in Column.describe_categorical in a child column instead. Note that child columns are also needed for variable-length strings.

To discuss:

If dtype is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it:

    def get_data_buffer(self) -> Tuple[_PandasBuffer, _Dtype]:
        """
        Return the buffer containing the data.
        """
        _k = _DtypeKind
        if self.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
            buffer = _PandasBuffer(self._col.to_numpy())
            dtype = self.dtype
        elif self.dtype[0] == _k.CATEGORICAL:
            codes = self._col.values.codes
            buffer = _PandasBuffer(codes)
            dtype = self._dtype_from_pandasdtype(codes.dtype)
        else:
            raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")

        return buffer, dtype

What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.
- What happens when the data is strings?

Apr 08 '21 14:04 rgommers