delta-rs
delta-rs copied to clipboard
Pyarrow dataset scan can't handle schema evolved nested structs
Environment
Delta-rs version: 0.10.1
Binding: Python
Environment:
- Cloud provider: Azure
- OS: Ubuntu 20.04
- Other: pyarrow 13.0.0
Bug
What happened: Exception is raised when trying to read a delta table that has undergone a schema evolution in a nested column. The problem is that the underlying parquet file is missing a column that now exists in the schema according to the delta transaction log.
Traceback (most recent call last):
File "minimal_reproduce.py", line 36, in <module>
pyarrow_table = deltalake.DeltaTable(temp_path).to_pyarrow_table()
File "/home/tomnewton/.local/lib/python3.8/site-packages/deltalake/table.py", line 529, in to_pyarrow_table
return self.to_pyarrow_dataset(
File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: string> output fields: struct<sub_column0: string, sub_column1: string>
What you expected to happen: Read the table without error. The new column that is not actually present in the underlying parquet files should be filled with nulls.
How to reproduce it:
I wrote a python script that reproduces it. Essentially this just writes to a delta table, makes a second write with schema evolution in a sub column then tries to read the resulting table. (zipped it because github doesn't allow uploading .py files)
minimal_reproduce.zip
More details:
This might need to be fixed in pyarrow but this slightly unusual usecase is motivated by delta-rs.
This is effectively a schema evolution when column data type is changing, which is not currently supported by Delta, right?
Error is thrown in the pyarrow kernel, as it expects all the input struct fields to be present in the output which is not possible of some rows have missing struct fields.
https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L368
@r3stl355 I'm pretty sure this kind of schema evolution is supported by delta. If you try my reproduction script which writes the delta table using spark it allows this schema evolution and it can also reads it back correctly. Only when trying to read with delta-rs + pyarrow does it break.
You are right @Tom-Newton, my bad, I should've checked the other implementation before saying above.
I think the problem is still with the way pyarrow work, not necessarily delta-rs (unless I missing something else). There is this test with TODO - may be related: https://github.com/apache/arrow/blob/fdecb6a0bee5fb482705de14c161853fe2ea2b41/cpp/src/arrow/dataset/scanner_test.cc#L2552
CastStruct (which I presume is throwing the error) is processing input in batches (I assume batch == file in this case) so it reads the input struct fields from the batch and matches it against some output struct which, as I see it, is the same for all the batches, so you have 2 options to specifify that ouput struct - it's either a union of all the fields (as it seems to be the case right now) and throws an error or it could be an intersection of fields in which case you would end up with subset of nested columns (or no data in your case).
I agree the problem may need to be resolved inside pyarrow but I think there is a chance it could be worked around in delta-rs. Regardless, I chose to create the issue here because delta-rs motivates this currently unsupported use-case and I thought it was more likely to be solved if I raised it here.
pyarrow is fine with adding entirely new columns through schema evolution just not sub fields of struct columns. That TODO you mention seems relevant to me.
Actually it looks like that TODO references a completed ticket https://github.com/apache/arrow/issues/31101. The example on that issue actually goes the opposite direction to what I want to resolve this problem. Also it looks like a new issue has been opened asking to implement the same functionality for pyarrow https://github.com/apache/arrow/issues/35408. I'll need to investigate a bit more when I get a chance.
Yeah, those may not be exactly the same as your issue. In addition to this, looks like Rust version is also suffering from the same as it's relying on Arrow DataFusion:
export TABLE_URI="<you table path here>"
cargo run --example load_table --features="datafusion"
results in Error during planning: Cannot cast file schema field nested_column of type Struct([Field { name: \"sub_column0\.. }]) to table schema field of type Struct([Field { name: \"sub_column1\"....
Hey @r3stl355 and @Tom-Newton! I faced the same problem right now. Do you have any updates about that?
I'm not aware of any progress on this. It's quite a rare occurrence for us and we've been working around it by re-writing relevant parts of tables, when we have problems.
@Tom-Newton did you create an issue in the arrow repo?
@Tom-Newton did you create an issue in the arrow repo?
I did not. I never really explored whether it would be possible to fix in delta-rs.
@Tom-Newton did you create an issue in the arrow repo?
I did not. I never really explored whether it would be possible to fix in
delta-rs.
I don't believe there is much we can do since the issue lies in the pyarrow scan function.