iceberg Improve read times and reduce size of metadata.json by storing schemas in external files

Feature Request / Improvement

Improve read times of metadata.json by storing schemas in a file and using the file pointers in the metadata.json instead of storing all schema copies that get accumulated with schema evolution.

Request

Instead of storing an ascii copy of a table schema in metadata.json, store a pointer to a file (i.e. a file path) containing the schema. The metadata files would then look something like,

schemas: [{type:"str", schema_file: "s3://some-bucket/some-schema.json", "schema-id": 0
}, {...}, ...]

And the file some-schema.json would contain a schema structure, , e.g.,

{
"type": "struct",
"fields": [
   {
     "id": 1,
       "name": "vendor_id",
         "type": "long",
         "required": false
    },
   ...
   ],
"identifier-field-ids": []
}

The schemas would be read from the external files lazily, e.g., only the current schema would need to be read here https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L353

An alternative solution, namely to create clean-up tools for schemas, has been mentioned in https://github.com/apache/iceberg/pull/3462, however, we're interested in a making an OSS contribution that could prevent the metadata files from growing so large to begin with.

Motivation

We have a use case for a table with approximately 10k columns. We expect to apply schema changes fairly often. The current json format is problematic for this use case because when a schema change is applied the entire previous schema(s) is reproduced in every susbsequent metadata.json. This causes the metadata files to grow large, using up significant disc space and impacting read time of the json. This issue of large metadata files has been mentioned previously in https://github.com/apache/iceberg/issues/5219, but the issue wasn't resolved.

We're interested in making the code changes and submitting a PR to implement the suggestion, but want to gauge community interest before pursuing it.

Query engine

None

Feb 16 '24 14:02 bdilday

Thanks for raising this. As another way to improve loading speed, is it possible to compress metadata file?

Feb 22 '24 09:02 liurenjie1024

@bdilday Definitely interested in this feature.

We had an issue where metadata.json grew so large with schema changes (in the GBs) that it was causing performance issues within the query engine we were using (Trino). For this particular table, we only retained a few days of snapshot history and ended up manually pruning schema versions that were not used by recent snapshots to cut down on file size.

However, the solution you proposed would be more robust and suitable for cases where you want to time travel on very wide tables

Apr 04 '24 17:04 moneymoroch

+1 to this idea as well.

I think this proposal will be a significant improvement to use cases where:

There are many schema changes
The schema is already large (think OBTs)

Moving the schemas to external files will reduce the number of bytes on read, but also the number of bytes that are copied to the new metadata files on each commit, as we can reuse existing schema pointers.

Apr 09 '24 14:04 sungwy

This would be a really big change to the spec, which doesn't mean we shoudln't do it, but does mean it would require a lot of clients updating. It seems like this is the sort of thing that would be better fixed by a REST implementation which could make those kind of decisions behind the scenes.

Apr 09 '24 15:04 RussellSpitzer

Yeah, I agree that it would definitely need a change to the spec and maybe this is something we could discuss for the V3 Spec milestone?

It seems like this is the sort of thing that would be better fixed by a REST implementation which could make those kind of decisions behind the scenes.

That's an interesting suggestion @RussellSpitzer - I can see how the REST Catalog could automatically trim the older schemas on commit, which would reduce the file sizes and therefore the read and write performances. But this sounds like a feature that would be better supported through a proper API like the one you proposed before: https://github.com/apache/iceberg/pull/3462.

Were there any other handling that you had in mind that we could introduce into a REST implementation, that wouldn't affect the existing Spec?

Apr 09 '24 16:04 sungwy