hudi
hudi copied to clipboard
[SUPPORT] Spark planner choose broadcast hash join for large HUDI data source
After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version), there is a frequent occurrence of the execution plan selecting "broadcast hash join" to broadcast a large HUDI data source.
I tried to investigate the cause of this issue.
In those cases, using
HadoopFsRelation
to read HUDI source, and Spark JoinSelection
would call HadoopFsRelation#sizeInBytes
to estimate the relation size to decide whether use broadcast join or not. And HadoopFsRelation#sizeInBytes
would call HoodieFileIndex#sizeInBytes
. But at the moment, no partitions are loaded because using default lazy Hudi's file-index implementation's file listing mode. So FileIndex#cachedAllInputFileSlices
is an empty map, then HadoopFsRelation#sizeInBytes
returns 0, it caused the suboptimal join plan.
After apply HUDI-6941, more cases could enabled lazy list mode by default, so the issue has become more frequent.
@xuzifu666 @codope Please help me confirm whether my analysis of this issue is correct. Is the FileIndex#sizeInBytes
better to return the Long.MAX
instead of 0 if FileIndex
has not done partition pruning yet?
Already find the root cause: the query job does not set extensions as HoodieSparkSessionExtension
, so the HoodiePruneFileSourcePartitions
is not taking effect.
BTW, should we use an overestimate size than 0 in HoodieFileIndex#sizeInBytes for those query jobs which forget set HoodieSparkSessionExtension, to avoid broadcast a very large HUDI table, like this patch commit#be9cf?