datatable icon indicating copy to clipboard operation
datatable copied to clipboard

Implement categorical columns

Open st-pasha opened this issue 6 years ago • 4 comments
trafficstars

A categorical column is semantically equivalent to a string column, except that it uses integer codes to store the values. The layout of such column is therefore:

T values[n];  // array of indices into a dictionary
StringColumn<int32> dict;  // "dictionary" column

(where T could be int8, int16 or int32).

Tasks and operations we can support for categoricals are:

Conversion:

  • [ ] implement type casts to categorical columns
  • [x] implement type casts from categorical columns
  • [ ] read/write categorical columns from/into Jay
  • [x] write categorical columns to csv
  • [ ] read categorical columns from csv (fread)
  • [ ] convert categorical columns to numpy
  • [ ] create categorical columns from numpy
  • [ ] convert categorical column to pandas
  • [ ] create categorical column from pandas
  • [ ] convert categorical columns to pyarrow
  • [ ] create categorical columns from pyarrow

st-pasha avatar Feb 27 '19 20:02 st-pasha

Just a quick question: the Rdatatable avoids categorical column (in R it's called factor) partially because it slows downs the performance. Just wonder will the performance of pydatatable be affected if introducing categorical column.

XiaomoWu avatar May 02 '19 06:05 XiaomoWu

Hmm, interesting, I have not heard about that. Perhaps, there are specific scenarios where the factor variables become slower? Like, when the number of factors approaches the number of rows?

There are clearly situations where factors would be preferable. For example, if there are only few of them: this would speed up sorting for example, and also have the potential to greatly reduce the required storage space.

Anyways, the categorical type should be in addition to, not as a replacement for the regular string type. So the user will be able to use whatever format better suits his/her need.

st-pasha avatar May 02 '19 21:05 st-pasha

@XiaomoWu the reason to avoid factor was not speed but problems with its levels when combining, filtering, or performing string operations like paste. Factors are faster than character, and can be processed in parallel, while character's R global cache is not thread safe.

jangorecki avatar May 03 '19 10:05 jangorecki

When implemented it might allow to read 1e9 data sets into db-benchmark for pandas, currently to_pandas() fails with OOM (afair). Having categoricals instead of objects could significantly reduce memory footprint. Recent attempt to optimise pandas read_csv has failed, see https://github.com/h2oai/db-benchmark/issues/99

jangorecki avatar Aug 24 '19 09:08 jangorecki