modin icon indicating copy to clipboard operation
modin copied to clipboard

Support import of category type in DB Engine

Open Garra1980 opened this issue 5 years ago • 7 comments

For following script

import pandas as pd
import pyarrow as pa
import sys
sys.setdlopenflags( 1|256 )    # RTLD_LAZY+RTLD_GLOBAL
from dbe import PyDbEngine as PDE

pdb = PDE("data", 8874)
data = {
        "id1": ["id1", "id2", "id3", "id1", "id2", "id3", "id1", "id2", "id3", "id1"],
    }
df = pd.DataFrame(data)
df["id1"] = df["id1"].astype("category")
a = pa.Table.from_pandas(df)
pdb.consumeArrowTable('testtable', a)

we currently have

terminate called after throwing an instance of 'std::runtime_error'
  what():  dictionary<values=string, indices=int8, ordered=0> is not yet supported.
Aborted (core dumped)

this is h2o case, reproducible on https://github.com/intel-go/omniscidb/commits/consuming_arrow_table

Garra1980 avatar Jul 08 '20 19:07 Garra1980

We need encoded text in terms of Omnisci since h2o benchmarks contain group by those text columns

Garra1980 avatar Jul 09 '20 08:07 Garra1980

It seems, like we need to remap category type in python from arrow::dictionary to arrow::utf8, because FSI does not support arrow dictionary and arrow creates separate dictionaries for each chunk.

fexolm avatar Jul 09 '20 13:07 fexolm

For now in case of import with arrow we can just omit conversion to category since FSI treats strings as encoded dictionary anyway

Garra1980 avatar Jul 13 '20 13:07 Garra1980

Waiting for Arrow2.0 enhancements to check import of category

Garra1980 avatar Jan 28 '21 09:01 Garra1980

NotImplementedError: unsupported type conversion error appear for:

df.astype("category")

@Garra1980 What are the advances in supporting categories?

anmyachev avatar May 06 '21 13:05 anmyachev

@anmyachev we are waiting a small fix from omnisci team. See https://github.com/modin-project/modin/issues/2747.

fexolm avatar May 06 '21 15:05 fexolm

@Garra1980 reproducer is outdated.

Note: Currently the astype call works but defaults to pandas.

If we need native support for categories (for HDK), maybe it makes sense to make a new issue and close this one?

anmyachev avatar Oct 29 '23 17:10 anmyachev

HDK engine is deprecated and will be removed in a future version.

YarShev avatar May 16 '24 11:05 YarShev