modin icon indicating copy to clipboard operation
modin copied to clipboard

Map modin category type to arrow dictionary type in omnisci backend

Open fexolm opened this issue 4 years ago • 5 comments

As omnisci finally supports arrow 2.0 we could explicitly map modin's category type to arrow's dictionary type. This would allow us to distinguish those types in FSI.

fexolm avatar Feb 17 '21 08:02 fexolm

https://github.com/intel-ai/omniscidb/tree/modin_test could be used as a source for omnisci backed wtih Arrow 2.0 support

Garra1980 avatar Feb 26 '21 12:02 Garra1980

I've implemented none encoded string support in the ArrowResultSetConverter. So, it shouldn't be a blocker anymore.

fexolm avatar Mar 22 '21 10:03 fexolm

Current intel-ai/omniscidb/modin_cats branch still has a problems with none-encoding strings. When such type of string occurs in the result set, the following exception may be thrown:

Exception: Columnar conversion not supported for variable length types

It's thrown from here and can be reproduced with this code:

import pyarrow as pa
import sys

prev = sys.getdlopenflags()
sys.setdlopenflags(1 | 256)  # RTLD_LAZY+RTLD_GLOBAL

from dbe import PyDbEngine

at = pa.Table.from_pydict(
    {"col1": ["str12", "str2", "str3"], "col2": [1, 2, 3]},
    schema=pa.schema({"col1": pa.string(), "col2": pa.int32()}),
)

server = PyDbEngine()
server.importArrowTable("test_name", at)

print(server.select_df("SELECT * FROM test_name ORDER BY col2"))    # OK
print(server.select_df("SELECT col1 FROM test_name ORDER BY col2")) # RuntimeError

The same queries in relation algebra notation:

Calcite RA queries
query 1
query_src: SELECT * FROM test_name ORDER BY col2
query_ra: {
  "rels": [
    {
      "id": "0",
      "relOp": "LogicalTableScan",
      "fieldNames": [
        "col1",
        "col2",
        "rowid"
      ],
      "table": [
        "omnisci",
        "test_name"
      ],
      "inputs": []
    },
    {
      "id": "1",
      "relOp": "LogicalProject",
      "fields": [
        "col1",
        "col2"
      ],
      "exprs": [
        {
          "input": 0
        },
        {
          "input": 1
        }
      ]
    },
    {
      "id": "2",
      "relOp": "LogicalSort",
      "collation": [
        {
          "field": 1,
          "direction": "ASCENDING",
          "nulls": "LAST"
        }
      ]
    }
  ]
}
query 2
query_src: SELECT col1 FROM test_name ORDER BY col2
query_ra: {
  "rels": [
    {
      "id": "0",
      "relOp": "LogicalTableScan",
      "fieldNames": [
        "col1",
        "col2",
        "rowid"
      ],
      "table": [
        "omnisci",
        "test_name"
      ],
      "inputs": []
    },
    {
      "id": "1",
      "relOp": "LogicalProject",
      "fields": [
        "col1",
        "col2"
      ],
      "exprs": [
        {
          "input": 0
        },
        {
          "input": 1
        }
      ]
    },
    {
      "id": "2",
      "relOp": "LogicalSort",
      "collation": [
        {
          "field": 1,
          "direction": "ASCENDING",
          "nulls": "LAST"
        }
      ]
    },
    {
      "id": "3",
      "relOp": "LogicalProject",
      "fields": [
        "col1"
      ],
      "exprs": [
        {
          "input": 0
        }
      ]
    }
  ]
}

As we can see from the calcite json, the thing that triggers exception is the projection of sort result. Other cases when that exception may be thrown are not investigated yet.

dchigarev avatar Mar 25 '21 13:03 dchigarev

So far it looks like problem is in Omnisci side, they are looking into it

Garra1980 avatar Apr 01 '21 11:04 Garra1980

So far it looks like problem is in Omnisci side, they are looking into it

Any updates here?

anmyachev avatar Oct 29 '23 17:10 anmyachev

HDK engine is deprecated and will be removed in a future version.

YarShev avatar May 16 '24 11:05 YarShev