Szehon Ho

Results 64 comments of Szehon Ho

Yea there is some limited discussion in this related issue, but I guess no good conclusions: https://github.com/apache/iceberg/issues/2542. Maybe a test writer that creates metadata files with all optional columns as...

@ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas. Ie, all the data files are rewritten...

Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a...

Hi,I think this is solved by https://github.com/apache/iceberg/pull/3099, (you can pass the relevant configs now supported by RetryingHiveMetaStoreClient) can you check?

Yea i think in current versions there is no good way I could find, unfortunately.

I'm going to try to do some simple benchmarks to validate it improves the perf, but putting the idea out here for any early feedback

I think this is actually promising, the performance gain is in line with expectation. Test: table with 1000 small snapshots, expire 1 at a time. The time and resources spent...

Patch should be ready for more review Fyi @aokolnychyi this is the snapshot-based metadata scan I mentioned, not sure if it will be useful elsewhere.

I'm not sure if people think these changes are too hacky. Another option I've thought, is to implement IncrementalScan (https://github.com/apache/iceberg/pull/4580) for All_files table (to be added in https://github.com/apache/iceberg/pull/4694), which will...

Actually I think I get why, all_files table does not parallelize the planning (reading each snapshot in spark task), so maybe better to keep this way (all_manifests table and ReadManifest...