hudi
hudi copied to clipboard
[HUDI-8371] Fixing column stats index with MDT for few scenarios
Change Logs
- Support bootstrapping of col stats for MOR table.
- Fix new partition instantiation w/ MDT when there are pending operations in data table.
- Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.
Impact
We could enable col stats for MOR table at any given state. Ran into other issues along the way which I had to fix to get the patch ready.
- DirectoryInfo was not accounting for files fetched from MDT. When a new MDT partition is initialized, to fetch file info, we poll MDT rather than doing FS based listing. This had some a bug and had to fix it.
- MDT writes had data loss while trying to initialize during a rollback or any pending instant in data table. So, fixed the same in this patch.
- When clean from data table is applied to MDT, we were nullifying the stats or marking it as deleted, but the record as such is not deleted from col stats partition and was lingering. Fixed the same in this patch.
Tests covered:
- bootstrapping of both COW and MOR table.
- Covered both partitioned and non-partitioned table.
- Ensure log files w/ delete block, partially failed log blocks and rollback blocks are accounted for in tests.
- Added tests to validate clean does remove the entry from col stats for both table types and partition and non-partitioned table.
Risk level (write none, low medium or high below)
low.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
@danny0405 : addressed all feedback from you
@danny0405 : addressed all comments.
@hudi-bot run azure
@hudi-bot run azure
@yihua : addressed and pushed an update.
CI report:
- 8025ba8536ac83ca97dacecc60a0b785cbd3da1e UNKNOWN
- fcb52e4b844c33ac74807a2617d0f5421fa24861 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
@nsivabalan We also need to fire a fix for the files partition under drop partition operation.
@danny0405 : yes, we have a follow up here https://issues.apache.org/jira/browse/HUDI-8449 since its enabled out of the box, I did not want to fix it in this patch. and scope might be large.