datatable Implement categorical columns

trafficstars

A categorical column is semantically equivalent to a string column, except that it uses integer codes to store the values. The layout of such column is therefore:

T values[n];  // array of indices into a dictionary
StringColumn<int32> dict;  // "dictionary" column

(where T could be int8, int16 or int32).

Tasks and operations we can support for categoricals are:

[x] implement types cat8, cat16 and cat32
[x] implement Categorical_ColumnImpl internals
[ ] think through and implement N/A handling
[x] create a categorical column from a python list
[x] convert a categorical column into a python list
[x] display categorical columns in a terminal
[x] element access through [i, j]
[x] column access through [:, j]
[x] slice access through [i, :]
[x] access the list of categories
[x] access the list of codes
[x] statistics

Conversion:

[ ] implement type casts to categorical columns
- [ ] categoricals
- [x] other types
[x] implement type casts from categorical columns
[ ] read/write categorical columns from/into Jay
[x] write categorical columns to csv
[ ] read categorical columns from csv (fread)
[ ] convert categorical columns to numpy
[ ] create categorical columns from numpy
[ ] convert categorical column to pandas
[ ] create categorical column from pandas
[ ] convert categorical columns to pyarrow
[ ] create categorical columns from pyarrow

Feb 27 '19 20:02 st-pasha

Just a quick question: the Rdatatable avoids categorical column (in R it's called factor) partially because it slows downs the performance. Just wonder will the performance of pydatatable be affected if introducing categorical column.

May 02 '19 06:05 XiaomoWu

Hmm, interesting, I have not heard about that. Perhaps, there are specific scenarios where the factor variables become slower? Like, when the number of factors approaches the number of rows?

There are clearly situations where factors would be preferable. For example, if there are only few of them: this would speed up sorting for example, and also have the potential to greatly reduce the required storage space.

Anyways, the categorical type should be in addition to, not as a replacement for the regular string type. So the user will be able to use whatever format better suits his/her need.

May 02 '19 21:05 st-pasha

@XiaomoWu the reason to avoid factor was not speed but problems with its levels when combining, filtering, or performing string operations like paste. Factors are faster than character, and can be processed in parallel, while character's R global cache is not thread safe.

May 03 '19 10:05 jangorecki

When implemented it might allow to read 1e9 data sets into db-benchmark for pandas, currently to_pandas() fails with OOM (afair). Having categoricals instead of objects could significantly reduce memory footprint. Recent attempt to optimise pandas read_csv has failed, see https://github.com/h2oai/db-benchmark/issues/99

Aug 24 '19 09:08 jangorecki

datatable datatable copied to clipboard

Implement categorical columns

datatable
datatable copied to clipboard