modin
modin copied to clipboard
Map modin category type to arrow dictionary type in omnisci backend
As omnisci finally supports arrow 2.0 we could explicitly map modin's category type to arrow's dictionary type. This would allow us to distinguish those types in FSI.
https://github.com/intel-ai/omniscidb/tree/modin_test could be used as a source for omnisci backed wtih Arrow 2.0 support
I've implemented none encoded string support in the ArrowResultSetConverter. So, it shouldn't be a blocker anymore.
Current intel-ai/omniscidb/modin_cats branch still has a problems with none-encoding strings. When such type of string occurs in the result set, the following exception may be thrown:
Exception: Columnar conversion not supported for variable length types
It's thrown from here and can be reproduced with this code:
import pyarrow as pa
import sys
prev = sys.getdlopenflags()
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL
from dbe import PyDbEngine
at = pa.Table.from_pydict(
{"col1": ["str12", "str2", "str3"], "col2": [1, 2, 3]},
schema=pa.schema({"col1": pa.string(), "col2": pa.int32()}),
)
server = PyDbEngine()
server.importArrowTable("test_name", at)
print(server.select_df("SELECT * FROM test_name ORDER BY col2")) # OK
print(server.select_df("SELECT col1 FROM test_name ORDER BY col2")) # RuntimeError
The same queries in relation algebra notation:
Calcite RA queries
query 1
query_src: SELECT * FROM test_name ORDER BY col2
query_ra: {
"rels": [
{
"id": "0",
"relOp": "LogicalTableScan",
"fieldNames": [
"col1",
"col2",
"rowid"
],
"table": [
"omnisci",
"test_name"
],
"inputs": []
},
{
"id": "1",
"relOp": "LogicalProject",
"fields": [
"col1",
"col2"
],
"exprs": [
{
"input": 0
},
{
"input": 1
}
]
},
{
"id": "2",
"relOp": "LogicalSort",
"collation": [
{
"field": 1,
"direction": "ASCENDING",
"nulls": "LAST"
}
]
}
]
}
query 2
query_src: SELECT col1 FROM test_name ORDER BY col2
query_ra: {
"rels": [
{
"id": "0",
"relOp": "LogicalTableScan",
"fieldNames": [
"col1",
"col2",
"rowid"
],
"table": [
"omnisci",
"test_name"
],
"inputs": []
},
{
"id": "1",
"relOp": "LogicalProject",
"fields": [
"col1",
"col2"
],
"exprs": [
{
"input": 0
},
{
"input": 1
}
]
},
{
"id": "2",
"relOp": "LogicalSort",
"collation": [
{
"field": 1,
"direction": "ASCENDING",
"nulls": "LAST"
}
]
},
{
"id": "3",
"relOp": "LogicalProject",
"fields": [
"col1"
],
"exprs": [
{
"input": 0
}
]
}
]
}
As we can see from the calcite json, the thing that triggers exception is the projection of sort result. Other cases when that exception may be thrown are not investigated yet.
So far it looks like problem is in Omnisci side, they are looking into it
So far it looks like problem is in Omnisci side, they are looking into it
Any updates here?
HDK engine is deprecated and will be removed in a future version.