modin
modin copied to clipboard
Support import of category type in DB Engine
For following script
import pandas as pd
import pyarrow as pa
import sys
sys.setdlopenflags( 1|256 ) # RTLD_LAZY+RTLD_GLOBAL
from dbe import PyDbEngine as PDE
pdb = PDE("data", 8874)
data = {
"id1": ["id1", "id2", "id3", "id1", "id2", "id3", "id1", "id2", "id3", "id1"],
}
df = pd.DataFrame(data)
df["id1"] = df["id1"].astype("category")
a = pa.Table.from_pandas(df)
pdb.consumeArrowTable('testtable', a)
we currently have
terminate called after throwing an instance of 'std::runtime_error'
what(): dictionary<values=string, indices=int8, ordered=0> is not yet supported.
Aborted (core dumped)
this is h2o case, reproducible on https://github.com/intel-go/omniscidb/commits/consuming_arrow_table
We need encoded text in terms of Omnisci since h2o benchmarks contain group by those text columns
It seems, like we need to remap category type in python from arrow::dictionary to arrow::utf8, because FSI does not support arrow dictionary and arrow creates separate dictionaries for each chunk.
For now in case of import with arrow we can just omit conversion to category since FSI treats strings as encoded dictionary anyway
Waiting for Arrow2.0 enhancements to check import of category
NotImplementedError: unsupported type conversion error appear for:
df.astype("category")
@Garra1980 What are the advances in supporting categories?
@anmyachev we are waiting a small fix from omnisci team. See https://github.com/modin-project/modin/issues/2747.
@Garra1980 reproducer is outdated.
Note: Currently the astype call works but defaults to pandas.
If we need native support for categories (for HDK), maybe it makes sense to make a new issue and close this one?
HDK engine is deprecated and will be removed in a future version.