hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-8371] Fixing column stats index with MDT for few scenarios

Open nsivabalan opened this issue 1 year ago • 2 comments

Change Logs

  • Support bootstrapping of col stats for MOR table.
  • Fix new partition instantiation w/ MDT when there are pending operations in data table.
  • Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Impact

We could enable col stats for MOR table at any given state. Ran into other issues along the way which I had to fix to get the patch ready.

  • DirectoryInfo was not accounting for files fetched from MDT. When a new MDT partition is initialized, to fetch file info, we poll MDT rather than doing FS based listing. This had some a bug and had to fix it.
  • MDT writes had data loss while trying to initialize during a rollback or any pending instant in data table. So, fixed the same in this patch.
  • When clean from data table is applied to MDT, we were nullifying the stats or marking it as deleted, but the record as such is not deleted from col stats partition and was lingering. Fixed the same in this patch.

Tests covered:

  • bootstrapping of both COW and MOR table.
  • Covered both partitioned and non-partitioned table.
  • Ensure log files w/ delete block, partially failed log blocks and rollback blocks are accounted for in tests.
  • Added tests to validate clean does remove the entry from col stats for both table types and partition and non-partitioned table.

Risk level (write none, low medium or high below)

low.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

nsivabalan avatar Oct 15 '24 17:10 nsivabalan

@danny0405 : addressed all feedback from you

nsivabalan avatar Oct 20 '24 16:10 nsivabalan

@danny0405 : addressed all comments.

nsivabalan avatar Oct 22 '24 06:10 nsivabalan

@hudi-bot run azure

nsivabalan avatar Oct 22 '24 14:10 nsivabalan

@hudi-bot run azure

nsivabalan avatar Oct 22 '24 19:10 nsivabalan

@yihua : addressed and pushed an update.

nsivabalan avatar Oct 27 '24 03:10 nsivabalan

CI report:

  • 8025ba8536ac83ca97dacecc60a0b785cbd3da1e UNKNOWN
  • fcb52e4b844c33ac74807a2617d0f5421fa24861 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Oct 27 '24 05:10 hudi-bot

@nsivabalan We also need to fire a fix for the files partition under drop partition operation.

danny0405 avatar Oct 27 '24 11:10 danny0405

@danny0405 : yes, we have a follow up here https://issues.apache.org/jira/browse/HUDI-8449 since its enabled out of the box, I did not want to fix it in this patch. and scope might be large.

nsivabalan avatar Oct 27 '24 14:10 nsivabalan