Daft
Daft copied to clipboard
faulty reading hudi table after it has beed altered
Describe the bug
General Description While reading data from a hudi table with daft.read_hudi() we are getting an error that is caused due to a miss-match of the columns.
Setup
- Our Hudi COW table is hosted on S3.
- After the initial setup, we modified the table by updating the
Avroschema (.avsc file) and adding two new string columns in between our existing columnsschema. For new data, these columns started populating correctly, while for existing rows, these new columns are null.
When running
dfd = daft.read_hudi('s3://path/to/hudi')
dfd.columns # this is Ok and returns correct column names
dfd.schema() # this is OK and returns correct schema
dfd.show() # we get an error
> ArrowTypeError: Expected XXXX, got a YYYY object
# Or on a different altered hudi table
> ArrowInvalid: Could not convert 'Florida' with type str: tried to convert to double
In both cases the type it is trying to use is the one prior to the alteration of the table
When reading a hudi table that has not been altered there is no problem.
We are using
getdaft==0.3.3 hoodie.table.version=5
Any suggestions?
Hello! This might be a pyhudi error -- cc @xushiyan from the Hudi team for any thoughts
We are currently awaiting the Hudi team's implementation of Hudi-rs which would give us more robust support for Hudi
Just adding additional context
It seams to be an Avro vs. Arrow issue.
When trying to use hudi-rs we also get an error:
ArrowInvalid: Schema at index X was different
this is an example of what we are running
from hudi import HudiTable # pip install hudi
import pyarrow as pa
hudi_path=f's3://path/to/hudi/table'
hudi_table = HudiTable(hudi_path)
records = hudi_table.read_snapshot()
arrow_table = pa.Table.from_batches(records)
The schema of the records are not the same.
Ok yeah this might be a Hudi issue in general then. Do you mind filing an issue against hudi-rs and linking that issue here please @sephib ?
hi @sephib , currently reading hudi table does not support incompatible schema evolution. Tracking the support in https://github.com/apache/hudi-rs/issues/77