iceberg
iceberg copied to clipboard
Core: Avoid reading ManifestFile when create ManifestReader
This patch aims to reduce the time of Iceberg task planning. We notice the task plan could not benefit from ParallelIterator a lot when there are many manifest files to read. The problem is the ManifestReader needs to read the manifest file to get the PartitionSpec which is not needed for most cases (because the ManifestFile object has the PartitionSpec Id and the Table has the mapping from PartitionSpec Id to PartitionSpec). It needs to read all these manifest files in serial when the following code of ParallelIterator is initing. And this is very slow when the HDFS traffic is busy.
https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java#L62
https://github.com/apache/iceberg/blob/af1d405204c1d8d242d02e658f7ef3cea4217595/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L207
After this change, we could get several times (depending on the number of driver threads and count of manifest files) time reduction.
Hi @rdblue @kbendick @szehon-ho could you help to review this when you are free? Thanks a lot.
I think this is an interesting find, it makes sense to me on the first reading, so partition-spec-id on the header of manifestFile is always the same as partition-spec-id of the manifest-file entry? Looks that way from the writer. Curious for @rdblue context why we read it from the header
so partition-spec-id on the header of manifestFile is always the same as partition-spec-id of the manifest-file entry
From here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/ManifestFile.java#L34. I think it should be. And the partition spec id stored in the Avro header is got from the given partition spec.
This mostly looks good, but there are some unnecessary behavior changes to fix. Thanks, @ConeyLiu!
@rdblue @szehon-ho can we get this merged? thanks
Note: will merge it soon if there's no further comments.
Merged, thanks @ConeyLiu and @rdblue for review.
Thanks @szehon-ho @rdblue @nastra @zinking