Daft icon indicating copy to clipboard operation
Daft copied to clipboard

faulty reading hudi table after it has beed altered

Open sephib opened this issue 1 year ago • 2 comments

Describe the bug

General Description While reading data from a hudi table with daft.read_hudi() we are getting an error that is caused due to a miss-match of the columns.

Setup

  1. Our Hudi COW table is hosted on S3.
  2. After the initial setup, we modified the table by updating the Avro schema (.avsc file) and adding two new string columns in between our existing columns schema. For new data, these columns started populating correctly, while for existing rows, these new columns are null.

When running

dfd = daft.read_hudi('s3://path/to/hudi')
dfd.columns   # this is Ok and returns correct column names
dfd.schema()  # this is OK and returns correct schema

dfd.show()  # we get an error
> ArrowTypeError: Expected XXXX, got a YYYY  object

# Or on a different altered hudi table
> ArrowInvalid: Could not convert 'Florida' with type str: tried to convert to double

In both cases the type it is trying to use is the one prior to the alteration of the table

When reading a hudi table that has not been altered there is no problem.

We are using

getdaft==0.3.3 hoodie.table.version=5

Any suggestions?

sephib avatar Sep 26 '24 13:09 sephib

Hello! This might be a pyhudi error -- cc @xushiyan from the Hudi team for any thoughts

We are currently awaiting the Hudi team's implementation of Hudi-rs which would give us more robust support for Hudi

jaychia avatar Sep 26 '24 17:09 jaychia

Just adding additional context It seams to be an Avro vs. Arrow issue. When trying to use hudi-rs we also get an error:

ArrowInvalid: Schema at index X was different

this is an example of what we are running

from hudi import HudiTable  # pip install hudi
import pyarrow as pa

hudi_path=f's3://path/to/hudi/table'
hudi_table = HudiTable(hudi_path)
records = hudi_table.read_snapshot()
arrow_table = pa.Table.from_batches(records)

The schema of the records are not the same.

sephib avatar Sep 29 '24 09:09 sephib

Ok yeah this might be a Hudi issue in general then. Do you mind filing an issue against hudi-rs and linking that issue here please @sephib ?

jaychia avatar Oct 07 '24 21:10 jaychia

hi @sephib , currently reading hudi table does not support incompatible schema evolution. Tracking the support in https://github.com/apache/hudi-rs/issues/77

xushiyan avatar Jun 29 '25 17:06 xushiyan