delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

`DeltaTable.to_pyarrow_dataset()` fails for tables containing map types

Open Tom-Newton opened this issue 2 years ago • 4 comments

Environment

Delta-rs version: 0.5.8

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Ubuntu 18.04
  • Other: Python 3.8

Bug

What happened: When using DeltaTable.to_pyarrow_dataset() for a table containing map types it crashes with:

~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_deltalake/deltalake/schema.py in pyarrow_field_from_dict(field)
    281     return pyarrow.field(
    282         field["name"],
--> 283         pyarrow_datatype_from_dict(field),
    284         field["nullable"],
    285         field.get("metadata"),

~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_deltalake/deltalake/schema.py in pyarrow_datatype_from_dict(json_dict)
    270             return pyarrow.float64()
    271     else:
--> 272         return pyarrow.type_for_alias(type_class)
    273 
    274 

~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_pyarrow/pyarrow/types.pxi in pyarrow.lib.type_for_alias()

ValueError: No type alias for map

What you expected to happen: Open the table without error.

How to reproduce it: Use to_pyarrow_dataset() on any table containing map types. I created a test that catches this.

More details: I think there are 2 reasons why this doesn't work currently:

  1. There is a bug in the python code to parse the schema json.
  2. The version of rust arrow used (15) does not support map types. After fixing point 1 we get ArrowException: C Data interface error: The datatype ""+m"" is still not supported in Rust implementation from this line. I'm unsure if this support is available in the latest version of rust arrow.

Tom-Newton avatar Jul 26 '22 12:07 Tom-Newton

I have a draft PR https://github.com/delta-io/delta-rs/pull/712 which fixes the first issue and I'm investigating the second issue.

Tom-Newton avatar Jul 26 '22 12:07 Tom-Newton

It looks like 2 in progress PRs will fix this https://github.com/delta-io/delta-rs/pull/703 https://github.com/delta-io/delta-rs/pull/684

Tom-Newton avatar Jul 26 '22 16:07 Tom-Newton

It looks like 2 in progress PRs will fix this #703 #684

Ok both of these PRs have merged but map types are not quite working fully.

  1. We need to upgrade arrow by one additional version 18.0.0 -> 19.0.0 to include the fix to https://github.com/apache/arrow-rs/issues/2037. #703 only upgraded to 18.0.0 because there is currently no release of datafusion that supports 19.0.0
  2. Some nested types have PyArrow casting issues. Potentially this could be resolved within delta-rs but it should definitely be resolved by https://issues.apache.org/jira/browse/ARROW-17349

Tom-Newton avatar Aug 10 '22 10:08 Tom-Newton

Hmm I just learned about an option in PyArrow use_compliant_nested_type, which might change some things. I'll look into this soon.

Docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter Background: https://issues.apache.org/jira/browse/ARROW-11497

wjones127 avatar Aug 10 '22 17:08 wjones127

FYI I fixed the upstream casting issue. I will be available in PyArrow 10.0.0, which will be released in the next couple weeks.

wjones127 avatar Oct 17 '22 20:10 wjones127

I've given it a test using Pyarrow 10.0.0 and everything seems to be working. Thanks everyone who contributed to fixing this especially @wjones127

Tom-Newton avatar Nov 09 '22 13:11 Tom-Newton