hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table

Open TheR1sing3un opened this issue 1 year ago • 4 comments

In some cases, a MOR table's latest (or view at time-travel specified instant) file-slices all have only base-file but empty log-files. When performs Snapshot-Query for these tables, we can regard it as MOR-ReadOptimized-Query and provide a HadoopFsRelation to Spark.

Change Logs

  1. regard mor snapshot query with all base-file-only table as mor read-optimized query Describe context and summary for this change. Highlight if any code was copied.

Impact

none Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

low If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [x] CI passed

TheR1sing3un avatar Oct 16 '24 10:10 TheR1sing3un

Is this change related: https://github.com/apache/hudi/pull/12080 ?

danny0405 avatar Oct 17 '24 00:10 danny0405

Is this change related: #12080 ?

#12080 is optimizing filter pushdown for HoodieBaseRelation by reducing unnecessary columns. My changes focus on regard [MergeOnReadRelation with all base-file-only file-slices] as BaseFileOnlyRelation so that we can fallback it to HadoopFsRelation. Spark has many optimizations for HadoopFsRelation which can improve our query performance.

TheR1sing3un avatar Oct 18 '24 04:10 TheR1sing3un

@hudi-bot run azure

TheR1sing3un avatar Oct 21 '24 02:10 TheR1sing3un

CI report:

  • 083aff297b487dac772c8d22ca90d28191288e8b UNKNOWN
  • 0802e2bd04ed640cbf77ce10b9769b43d2995087 Azure: FAILURE Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Oct 21 '24 03:10 hudi-bot

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

jonvex avatar Oct 21 '24 18:10 jonvex

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

Do you mean to use NewHoodieParquetFileFormat? image It looks like still an experimental feature. In many real scene for our product environment, we still query with relation implementations. IMO, maybe we could introduce [MOR-SNAPSHOT-QUERY-FALLBACK-TO-HadoopFsRelation] with relation implementations, rather than directly changing to apply NewHoodieParquetFileFormat. Looking forward to your reply!

TheR1sing3un avatar Oct 22 '24 02:10 TheR1sing3un