feat: Support metadata table "Entries"
Re #823. This adds support for the the Manifest Entries (docs) which lists entries in the current snapshot's manifest files.
Relevant classes are (updated):
EntriesTablewhich builds the Arrow scan and returns the Iceberg schema.DataFileStructBuilderbuilds thedata_filestruct field where each element has values of a data file.ReadableMetricsStructBuilderbuilds thereadable_metricsstruct containing metrics for primitive fields in the table.
This PR ended up being quite verbose because arrow-rs is strict about declaring generic types of array builders at compile time. Unlike Python, which supports entries in ~100 lines, we can't insert dict into a StructBuilder. Ideally, we could build StructArray row-by-row with Any values and write logic to convert manifest entries to rows. (Let me know if I'm missing something.)
Reference implementations:
Thank you, @xuanwo. Rebased and ready for review.
I've rebased on #870 and #872 to address the follow:
- The entries table now lives in a separate
entries.rsfile. - Batches for manifest files are now computed asynchronously.
I haven't address @liurenjie1024's point about schema() returning an Iceberg schema instead of an Arrow one. We have issue #868 and PR #871 but I'm not sure that's what we want generally and whether you want this to happen before this PR.
Otherwise this is ready for another review.
I've rebased on #870 and #872 to address the follow:
- The entries table now lives in a separate
entries.rsfile.- Batches for manifest files are now computed asynchronously.
I haven't address @liurenjie1024's point about
schema()returning an Iceberg schema instead of an Arrow one. We have issue #868 and PR #871 but I'm not sure that's what we want generally and whether you want this to happen before this PR.Otherwise this is ready for another review.
Thanks @rshkv for the contribution, let's continue discussion about schema in #868
I'm still working on this and will continue working on this. I find it quite difficult to get Arrow and Iceberg to agree on types because of field ids.
The Iceberg schema requires that all fields have a field id, which in the converted Arrow schema becomes type metadata. But when constructing StructArray or RecordBatch, Arrow checks that the schema matches that of the data (e.g.). And they tend to not match because the schema has field ids and the data does not.
E.g., with MapBuilder, we have with_values_field to pass in a field with a field id in metadata. However there is no with_keys_field equivalent. Yet, the key field in the Iceberg schema must have a field id.
@liurenjie1024, I'd like to see this finished and some clarity on designs you'd merge would help. Here are some questions:
- We can make
schema()return an Iceberg schema with field id, but how important is it that a returnedRecordBatchhas field ids in metadata? Are we ok with not having field ids onRecordBatch#schemabut only onMetadataTable#schema? - If we do need record batches types to have those field ids, we might need changes in arrow-rs to express something like "a
RecordBatchorStructArraymay have fields with metadata, but the respective types of the underlyingArrayDatadon't need to match metadata".
Let me know what you think or if I'm not being clear.
But when constructing StructArray or RecordBatch, Arrow checks that the schema matches that of the data (e.g.). And they tend to not match because the schema has field ids and the data does not.
A little confused about this point. What resulted them mismatching? IIUC if we use the correct schema to construct Array/RecordBatch, there should be no mismatch.
@xxchan, that's right. The problem is constructing Array/RecordBatch with the right schema in the first place. MapBuilder doesn't let you configure the key field though, so the keys array won't have the field id.
MapBuilderdoesn't let you configure the key field though, so the keys array won't have the field id.
Not sure if I understand it correctly, this sounds like just some limitations of arrow-rs and can be solved easily by adding some new APIs, instead of some fundamental difficulty.
BTW maybe it's easier to discuss if you can push some sample code.
We can make schema() return an Iceberg schema with field id, but how important is it that a returned RecordBatch has field ids in metadata? Are we ok with not having field ids on RecordBatch#schema but only on MetadataTable#schema?
Yes, that's alright. The field-IDs are internal to Iceberg, and when it goes to the engine it is based on the environment, SQL is position-based, while you could also lookup by name in dataframes.
Just noticed that in the Java implementation, the schema has field ids, defined like this: https://github.com/apache/iceberg-rust/pull/863#pullrequestreview-2588447860
We don't have to expose the field-IDs internally, as long as we resolve the fields using the ID. The name might change over time (see https://github.com/apache/iceberg/pull/5338/), but if you read the data as an Iceberg table, then we resolve the fields by ID then everything will be correct.
BTW maybe it's easier to discuss if you can push some sample code.
Pushed my work-in-progress. Currently failing here (scroll right):
Incorrect datatype for StructArray field \"column_sizes\", expected
Map(Field {
name: \"key_value\",
data_type: Struct([
Field { name: \"key\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"117\"} },
Field { name: \"value\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"118\"} }]),
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)
got
Map(Field {
name: \"key_value\",
data_type: Struct([
Field { name: \"key\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: \"value\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]),
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)
Not sure if I understand it correctly, this sounds like just some limitations of
arrow-rsand can be solved easily by adding some new APIs, instead of some fundamental difficulty.
That's right. Going to open a PR with arrow-rs.
The arrow-rs change we needed (here) got shipped in 54.2.0 which we already picked up here.
That means the MapBuilder instances have a key field that preserves our Iceberg field id in metadata.
With field ids and serving an Iceberg schema (as opposed to Arrow) addressed, this is ready for another view.
Sorry for the long wait. I believe this PR is ready to go. Would you like to help resolve the conflicts?
Yay! I’ll resolve conflicts