hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7241] Avoid always broadcast HUDI relation if not using HoodieSparkSessionExtension

Open beyond1920 opened this issue 1 year ago • 7 comments

Change Logs

After apply HUDI-6941 in the internal HUDI version(based on 0.14.0 version), there is a frequent occurrence of the execution plan selecting "broadcast hash join" to broadcast a large HUDI data source. image I tried to investigate the cause of this issue. Those old query jobs does not set spark.sql.extensions as HoodieSparkSessionExtension because the user does not known the source table has been migrate from Hive table to HUDI table. So the HoodiePruneFileSourcePartitions is not taking effect. Then when JoinSelection call HoodieFsRelation#Relation#sizeInBytes which routing to FileIndex#sizeInBytes, it would return 0 because FileIndex using lazy list mode by default. It causes broadcast the HUDI source. And after apply HUDI-6941, more cases could enabled lazy list mode by default, so the issue has become more frequent.

The pr aims to fix the issue#10343.

Impact

NA

Risk level (write none, low medium or high below)

NA

Documentation Update

NA

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

beyond1920 avatar Dec 20 '23 04:12 beyond1920

Hi @beyond1920 could you please let me know why you closed this? I can assist or take over fixing this. Please let me know if you still think this is an issue

jonvex avatar Dec 20 '23 16:12 jonvex

cc @bvaradar as well

vinothchandar avatar Dec 20 '23 20:12 vinothchandar

@jonvex @vinothchandar Thanks a lot for attention. I close the issue because I find the root cause of broadcast a large HUDI relation is those query jobs does not set extensions as HoodieSparkSessionExtension, so the HoodiePruneFileSourcePartitions is not taking effect. Then when JoinSelection call HoodieFsRelation#Relation#sizeInBytes which routing to FileIndex#sizeInBytes, it would return 0 because FileIndex using lazy list mode by default. It causes broadcast the HUDI source.

beyond1920 avatar Dec 21 '23 03:12 beyond1920

I currently solve the problem by set extensions as HoodieSparkSessionExtension for jobs which not only which writing to a HUDI table but also read from a HUDI table. Otherwise , the query jobs which join with a HUDI table might choose broadcast the HUDI relation.

beyond1920 avatar Dec 21 '23 03:12 beyond1920

@jonvex @vinothchandar BTW, should we use a overestimate size than 0 in HoodieFileIndex#sizeInBytes for those query jobs which forget set HoodieSparkSessionExtension, to avoid broadcast a very large HUDI table, like this patch commit#be9cf?

beyond1920 avatar Dec 21 '23 03:12 beyond1920

cc @jonvex for taking a look again~

danny0405 avatar Dec 26 '23 01:12 danny0405

@bvaradar Thanks for suggestion. I updated the PR.

beyond1920 avatar Jan 10 '24 07:01 beyond1920

CI report:

  • 21de94929f716f94ca3af46d376aa2ccfb32d791 UNKNOWN
  • 7cb7026f4564244fc96ef5518072889ada82000f Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Jan 10 '24 11:01 hudi-bot