hudi
hudi copied to clipboard
[HUDI-9451] Avoid broadcasting unnecessary `FileSlice` when reading
Our current logic is that we will disguise all the files under each partition as a PartitionDirectory together. In order to enable the task to know the files that really need to be read, we have also put the collection of all FileSlice under this partition into the PartitionValue.
It is convenient to find the corresponding file slice to be read from the file slice mapping set in the PartitionValue when each subsequent task is executed and read.
However, I found that when the number of files in one partition increases, for example, when there are tens of thousands of files in one partition, the file slices in the PartitionValue will be 100MB+ in size. And when spark creates reading tasks, it needs to pass this mapping of FileSlice to each task. Therefore, under our default configuration, it will lead to the failure of job. Moreover, for each task, it only cares about the FileSlice it needs to read and does not need to pass all the FileSlice under the partition to it.
Therefore, I optimized the above logic. I will only pass the FileSlice object that each reading task needs to read, successively reducing the invalid broadcast overhead of task creation.
Change Logs
- Avoid broadcasting unnecessary
FileSlicewhen reading
Impact
improve query stability
Risk level (write none, low medium or high below)
low
Documentation Update
none
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed
@danny0405 All checks have passed! Could you review it again? Thank you!
@danny0405 Can we continue to advance this pr?
CI report:
- e934098d795e4582788281159b4b4a5901fd0e50 UNKNOWN
- ecf8b9b13596f89cd266fe84a2aa08baf7e0b7ed Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
@danny0405 Hi Danny, azure pass, https://dev.azure.com/apachehudi/hudi-oss-ci/_build/results?buildId=6668&view=results
@danny0405 All checks have passed. Let's land it!Thanks!