Szehon Ho comments

Results 64 comments of


                                            Szehon Ho

Core: Add schema_id to ContentFile/ManifestFile

Yea there is some limited discussion in this related issue, but I guess no good conclusions: https://github.com/apache/iceberg/issues/2542. Maybe a test writer that creates metadata files with all optional columns as...

Core: Add schema_id to ContentFile/ManifestFile

@ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas. Ie, all the data files are rewritten...

Core: Add schema_id to ContentFile/ManifestFile

Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a...

Make retry number and backoff policy configurable

Hi,I think this is solved by https://github.com/apache/iceberg/pull/3099, (you can pass the relevant configs now supported by RetryingHiveMetaStoreClient) can you check?

Make retry number and backoff policy configurable

Yea i think in current versions there is no good way I could find, unfortunately.

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

I'm going to try to do some simple benchmarks to validate it improves the perf, but putting the idea out here for any early feedback

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

I think this is actually promising, the performance gain is in line with expectation. Test: table with 1000 small snapshots, expire 1 at a time. The time and resources spent...

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

Patch should be ready for more review Fyi @aokolnychyi this is the snapshot-based metadata scan I mentioned, not sure if it will be useful elsewhere.

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

I'm not sure if people think these changes are too hacky. Another option I've thought, is to implement IncrementalScan (https://github.com/apache/iceberg/pull/4580) for All_files table (to be added in https://github.com/apache/iceberg/pull/4694), which will...

Spark: Improve performance of expire snapshot by not double-scanning retained Snapshots

Actually I think I get why, all_files table does not parallelize the planning (reading each snapshot in spark task), so maybe better to keep this way (all_manifests table and ReadManifest...