hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-9451] Avoid broadcasting unnecessary `FileSlice` when reading

Open TheR1sing3un opened this issue 6 months ago • 2 comments

Our current logic is that we will disguise all the files under each partition as a PartitionDirectory together. In order to enable the task to know the files that really need to be read, we have also put the collection of all FileSlice under this partition into the PartitionValue. It is convenient to find the corresponding file slice to be read from the file slice mapping set in the PartitionValue when each subsequent task is executed and read. However, I found that when the number of files in one partition increases, for example, when there are tens of thousands of files in one partition, the file slices in the PartitionValue will be 100MB+ in size. And when spark creates reading tasks, it needs to pass this mapping of FileSlice to each task. Therefore, under our default configuration, it will lead to the failure of job. Moreover, for each task, it only cares about the FileSlice it needs to read and does not need to pass all the FileSlice under the partition to it. Therefore, I optimized the above logic. I will only pass the FileSlice object that each reading task needs to read, successively reducing the invalid broadcast overhead of task creation.

Change Logs

  1. Avoid broadcasting unnecessary FileSlice when reading

Impact

improve query stability

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [x] CI passed

TheR1sing3un avatar May 24 '25 16:05 TheR1sing3un

@danny0405 All checks have passed! Could you review it again? Thank you!

TheR1sing3un avatar May 26 '25 13:05 TheR1sing3un

@danny0405 Can we continue to advance this pr?

TheR1sing3un avatar Jun 27 '25 04:06 TheR1sing3un

CI report:

  • e934098d795e4582788281159b4b4a5901fd0e50 UNKNOWN
  • ecf8b9b13596f89cd266fe84a2aa08baf7e0b7ed Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Jul 10 '25 05:07 hudi-bot

@danny0405 Hi Danny, azure pass, https://dev.azure.com/apachehudi/hudi-oss-ci/_build/results?buildId=6668&view=results

TheR1sing3un avatar Jul 10 '25 06:07 TheR1sing3un

@danny0405 All checks have passed. Let's land it!Thanks!

TheR1sing3un avatar Jul 15 '25 06:07 TheR1sing3un