iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

feat: Support metadata table "Entries"

Open rshkv opened this issue 11 months ago • 12 comments

Re #823. This adds support for the the Manifest Entries (docs) which lists entries in the current snapshot's manifest files.

Relevant classes are (updated):

  • EntriesTable which builds the Arrow scan and returns the Iceberg schema.
  • DataFileStructBuilder builds the data_file struct field where each element has values of a data file.
  • ReadableMetricsStructBuilder builds the readable_metrics struct containing metrics for primitive fields in the table.

This PR ended up being quite verbose because arrow-rs is strict about declaring generic types of array builders at compile time. Unlike Python, which supports entries in ~100 lines, we can't insert dict into a StructBuilder. Ideally, we could build StructArray row-by-row with Any values and write logic to convert manifest entries to rows. (Let me know if I'm missing something.)

Reference implementations:

rshkv avatar Jan 01 '25 21:01 rshkv

Thank you, @xuanwo. Rebased and ready for review.

rshkv avatar Jan 02 '25 22:01 rshkv

I've rebased on #870 and #872 to address the follow:

  • The entries table now lives in a separate entries.rs file.
  • Batches for manifest files are now computed asynchronously.

I haven't address @liurenjie1024's point about schema() returning an Iceberg schema instead of an Arrow one. We have issue #868 and PR #871 but I'm not sure that's what we want generally and whether you want this to happen before this PR.

Otherwise this is ready for another review.

rshkv avatar Jan 08 '25 15:01 rshkv

I've rebased on #870 and #872 to address the follow:

  • The entries table now lives in a separate entries.rs file.
  • Batches for manifest files are now computed asynchronously.

I haven't address @liurenjie1024's point about schema() returning an Iceberg schema instead of an Arrow one. We have issue #868 and PR #871 but I'm not sure that's what we want generally and whether you want this to happen before this PR.

Otherwise this is ready for another review.

Thanks @rshkv for the contribution, let's continue discussion about schema in #868

liurenjie1024 avatar Jan 09 '25 10:01 liurenjie1024

I'm still working on this and will continue working on this. I find it quite difficult to get Arrow and Iceberg to agree on types because of field ids.

The Iceberg schema requires that all fields have a field id, which in the converted Arrow schema becomes type metadata. But when constructing StructArray or RecordBatch, Arrow checks that the schema matches that of the data (e.g.). And they tend to not match because the schema has field ids and the data does not.

E.g., with MapBuilder, we have with_values_field to pass in a field with a field id in metadata. However there is no with_keys_field equivalent. Yet, the key field in the Iceberg schema must have a field id.

@liurenjie1024, I'd like to see this finished and some clarity on designs you'd merge would help. Here are some questions:

  • We can make schema() return an Iceberg schema with field id, but how important is it that a returned RecordBatch has field ids in metadata? Are we ok with not having field ids on RecordBatch#schema but only on MetadataTable#schema?
  • If we do need record batches types to have those field ids, we might need changes in arrow-rs to express something like "a RecordBatch or StructArray may have fields with metadata, but the respective types of the underlying ArrayData don't need to match metadata".

Let me know what you think or if I'm not being clear.

rshkv avatar Jan 24 '25 17:01 rshkv

But when constructing StructArray or RecordBatch, Arrow checks that the schema matches that of the data (e.g.). And they tend to not match because the schema has field ids and the data does not.

A little confused about this point. What resulted them mismatching? IIUC if we use the correct schema to construct Array/RecordBatch, there should be no mismatch.

xxchan avatar Feb 02 '25 02:02 xxchan

@xxchan, that's right. The problem is constructing Array/RecordBatch with the right schema in the first place. MapBuilder doesn't let you configure the key field though, so the keys array won't have the field id.

rshkv avatar Feb 02 '25 12:02 rshkv

MapBuilder doesn't let you configure the key field though, so the keys array won't have the field id.

Not sure if I understand it correctly, this sounds like just some limitations of arrow-rs and can be solved easily by adding some new APIs, instead of some fundamental difficulty.

BTW maybe it's easier to discuss if you can push some sample code.

xxchan avatar Feb 02 '25 13:02 xxchan

We can make schema() return an Iceberg schema with field id, but how important is it that a returned RecordBatch has field ids in metadata? Are we ok with not having field ids on RecordBatch#schema but only on MetadataTable#schema?

Yes, that's alright. The field-IDs are internal to Iceberg, and when it goes to the engine it is based on the environment, SQL is position-based, while you could also lookup by name in dataframes.

Fokko avatar Feb 03 '25 19:02 Fokko

Just noticed that in the Java implementation, the schema has field ids, defined like this: https://github.com/apache/iceberg-rust/pull/863#pullrequestreview-2588447860

We don't have to expose the field-IDs internally, as long as we resolve the fields using the ID. The name might change over time (see https://github.com/apache/iceberg/pull/5338/), but if you read the data as an Iceberg table, then we resolve the fields by ID then everything will be correct.

Fokko avatar Feb 03 '25 19:02 Fokko

BTW maybe it's easier to discuss if you can push some sample code.

Pushed my work-in-progress. Currently failing here (scroll right):

Incorrect datatype for StructArray field \"column_sizes\", expected

Map(Field { 
  name: \"key_value\", 
  data_type: Struct([
    Field { name: \"key\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"117\"} }, 
    Field { name: \"value\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"118\"} }]),
  nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)

got

Map(Field {
  name: \"key_value\",
  data_type: Struct([
    Field { name: \"key\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} },
    Field { name: \"value\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]),
  nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)

Not sure if I understand it correctly, this sounds like just some limitations of arrow-rs and can be solved easily by adding some new APIs, instead of some fundamental difficulty.

That's right. Going to open a PR with arrow-rs.

rshkv avatar Feb 09 '25 01:02 rshkv

The arrow-rs change we needed (here) got shipped in 54.2.0 which we already picked up here.

That means the MapBuilder instances have a key field that preserves our Iceberg field id in metadata.

rshkv avatar Mar 13 '25 01:03 rshkv

With field ids and serving an Iceberg schema (as opposed to Arrow) addressed, this is ready for another view.

rshkv avatar Mar 13 '25 02:03 rshkv

Sorry for the long wait. I believe this PR is ready to go. Would you like to help resolve the conflicts?

Xuanwo avatar Sep 09 '25 02:09 Xuanwo

Yay! I’ll resolve conflicts

rshkv avatar Sep 09 '25 14:09 rshkv