datatable
datatable copied to clipboard
Implement categorical columns
A categorical column is semantically equivalent to a string column, except that it uses integer codes to store the values. The layout of such column is therefore:
T values[n]; // array of indices into a dictionary
StringColumn<int32> dict; // "dictionary" column
(where T could be int8, int16 or int32).
Tasks and operations we can support for categoricals are:
- [x] implement types
cat8,cat16andcat32 - [x] implement
Categorical_ColumnImplinternals - [ ] think through and implement
N/Ahandling - [x] create a categorical column from a python list
- [x] convert a categorical column into a python list
- [x] display categorical columns in a terminal
- [x] element access through
[i, j] - [x] column access through
[:, j] - [x] slice access through
[i, :] - [x] access the list of categories
- [x] access the list of codes
- [x] statistics
Conversion:
- [ ] implement type casts to categorical columns
- [ ] categoricals
- [x] other types
- [x] implement type casts from categorical columns
- [ ] read/write categorical columns from/into Jay
- [x] write categorical columns to csv
- [ ] read categorical columns from csv (fread)
- [ ] convert categorical columns to numpy
- [ ] create categorical columns from numpy
- [ ] convert categorical column to pandas
- [ ] create categorical column from pandas
- [ ] convert categorical columns to pyarrow
- [ ] create categorical columns from pyarrow
Just a quick question: the Rdatatable avoids categorical column (in R it's called factor) partially because it slows downs the performance. Just wonder will the performance of pydatatable be affected if introducing categorical column.
Hmm, interesting, I have not heard about that. Perhaps, there are specific scenarios where the factor variables become slower? Like, when the number of factors approaches the number of rows?
There are clearly situations where factors would be preferable. For example, if there are only few of them: this would speed up sorting for example, and also have the potential to greatly reduce the required storage space.
Anyways, the categorical type should be in addition to, not as a replacement for the regular string type. So the user will be able to use whatever format better suits his/her need.
@XiaomoWu the reason to avoid factor was not speed but problems with its levels when combining, filtering, or performing string operations like paste. Factors are faster than character, and can be processed in parallel, while character's R global cache is not thread safe.
When implemented it might allow to read 1e9 data sets into db-benchmark for pandas, currently to_pandas() fails with OOM (afair). Having categoricals instead of objects could significantly reduce memory footprint.
Recent attempt to optimise pandas read_csv has failed, see https://github.com/h2oai/db-benchmark/issues/99