iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Core: Avoid reading ManifestFile when create ManifestReader

Open ConeyLiu opened this issue 2 years ago • 5 comments

This patch aims to reduce the time of Iceberg task planning. We notice the task plan could not benefit from ParallelIterator a lot when there are many manifest files to read. The problem is the ManifestReader needs to read the manifest file to get the PartitionSpec which is not needed for most cases (because the ManifestFile object has the PartitionSpec Id and the Table has the mapping from PartitionSpec Id to PartitionSpec). It needs to read all these manifest files in serial when the following code of ParallelIterator is initing. And this is very slow when the HDFS traffic is busy.

https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java#L62

https://github.com/apache/iceberg/blob/af1d405204c1d8d242d02e658f7ef3cea4217595/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L207

After this change, we could get several times (depending on the number of driver threads and count of manifest files) time reduction.

ConeyLiu avatar Aug 25 '22 12:08 ConeyLiu

Hi @rdblue @kbendick @szehon-ho could you help to review this when you are free? Thanks a lot.

ConeyLiu avatar Aug 25 '22 12:08 ConeyLiu

I think this is an interesting find, it makes sense to me on the first reading, so partition-spec-id on the header of manifestFile is always the same as partition-spec-id of the manifest-file entry? Looks that way from the writer. Curious for @rdblue context why we read it from the header

szehon-ho avatar Aug 26 '22 01:08 szehon-ho

so partition-spec-id on the header of manifestFile is always the same as partition-spec-id of the manifest-file entry

From here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/ManifestFile.java#L34. I think it should be. And the partition spec id stored in the Avro header is got from the given partition spec.

ConeyLiu avatar Aug 26 '22 06:08 ConeyLiu

This mostly looks good, but there are some unnecessary behavior changes to fix. Thanks, @ConeyLiu!

rdblue avatar Aug 28 '22 22:08 rdblue

@rdblue @szehon-ho can we get this merged? thanks

zinking avatar Sep 15 '22 13:09 zinking

Note: will merge it soon if there's no further comments.

szehon-ho avatar Nov 01 '22 22:11 szehon-ho

Merged, thanks @ConeyLiu and @rdblue for review.

szehon-ho avatar Nov 04 '22 00:11 szehon-ho

Thanks @szehon-ho @rdblue @nastra @zinking

ConeyLiu avatar Nov 04 '22 12:11 ConeyLiu