`DeltaTable.to_pyarrow_dataset()` fails for tables containing map types
Environment
Delta-rs version: 0.5.8
Binding: Python
Environment:
- Cloud provider: Azure
- OS: Ubuntu 18.04
- Other: Python 3.8
Bug
What happened:
When using DeltaTable.to_pyarrow_dataset() for a table containing map types it crashes with:
~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_deltalake/deltalake/schema.py in pyarrow_field_from_dict(field)
281 return pyarrow.field(
282 field["name"],
--> 283 pyarrow_datatype_from_dict(field),
284 field["nullable"],
285 field.get("metadata"),
~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_deltalake/deltalake/schema.py in pyarrow_datatype_from_dict(json_dict)
270 return pyarrow.float64()
271 else:
--> 272 return pyarrow.type_for_alias(type_class)
273
274
~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_pyarrow/pyarrow/types.pxi in pyarrow.lib.type_for_alias()
ValueError: No type alias for map
What you expected to happen: Open the table without error.
How to reproduce it:
Use to_pyarrow_dataset() on any table containing map types.
I created a test that catches this.
More details: I think there are 2 reasons why this doesn't work currently:
- There is a bug in the python code to parse the schema json.
- The version of rust arrow used (15) does not support map types. After fixing point 1 we get
ArrowException: C Data interface error: The datatype ""+m"" is still not supported in Rust implementationfrom this line. I'm unsure if this support is available in the latest version of rust arrow.
I have a draft PR https://github.com/delta-io/delta-rs/pull/712 which fixes the first issue and I'm investigating the second issue.
It looks like 2 in progress PRs will fix this https://github.com/delta-io/delta-rs/pull/703 https://github.com/delta-io/delta-rs/pull/684
It looks like 2 in progress PRs will fix this #703 #684
Ok both of these PRs have merged but map types are not quite working fully.
- We need to upgrade arrow by one additional version 18.0.0 -> 19.0.0 to include the fix to https://github.com/apache/arrow-rs/issues/2037. #703 only upgraded to 18.0.0 because there is currently no release of datafusion that supports 19.0.0
- Some nested types have PyArrow casting issues. Potentially this could be resolved within delta-rs but it should definitely be resolved by https://issues.apache.org/jira/browse/ARROW-17349
Hmm I just learned about an option in PyArrow use_compliant_nested_type, which might change some things. I'll look into this soon.
Docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter Background: https://issues.apache.org/jira/browse/ARROW-11497
FYI I fixed the upstream casting issue. I will be available in PyArrow 10.0.0, which will be released in the next couple weeks.
I've given it a test using Pyarrow 10.0.0 and everything seems to be working. Thanks everyone who contributed to fixing this especially @wjones127