hudi [HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table

In some cases, a MOR table's latest (or view at time-travel specified instant) file-slices all have only base-file but empty log-files. When performs Snapshot-Query for these tables, we can regard it as MOR-ReadOptimized-Query and provide a HadoopFsRelation to Spark.

Change Logs

regard mor snapshot query with all base-file-only table as mor read-optimized query Describe context and summary for this change. Highlight if any code was copied.

Impact

none Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

low If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[x] Read through contributor's guide
[x] Change Logs and Impact were stated clearly
[x] Adequate tests were added if applicable
[x] CI passed

Oct 16 '24 10:10 TheR1sing3un

Is this change related: https://github.com/apache/hudi/pull/12080 ?

Oct 17 '24 00:10 danny0405

Is this change related: #12080 ?

#12080 is optimizing filter pushdown for HoodieBaseRelation by reducing unnecessary columns. My changes focus on regard [MergeOnReadRelation with all base-file-only file-slices] as BaseFileOnlyRelation so that we can fallback it to HadoopFsRelation. Spark has many optimizations for HadoopFsRelation which can improve our query performance.

Oct 18 '24 04:10 TheR1sing3un

@hudi-bot run azure

Oct 21 '24 02:10 TheR1sing3un

CI report:

083aff297b487dac772c8d22ca90d28191288e8b UNKNOWN
0802e2bd04ed640cbf77ce10b9769b43d2995087 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Oct 21 '24 03:10 hudi-bot

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

Oct 21 '24 18:10 jonvex

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

Do you mean to use NewHoodieParquetFileFormat? It looks like still an experimental feature. In many real scene for our product environment, we still query with relation implementations. IMO, maybe we could introduce [MOR-SNAPSHOT-QUERY-FALLBACK-TO-HadoopFsRelation] with relation implementations, rather than directly changing to apply NewHoodieParquetFileFormat. Looking forward to your reply!

Oct 22 '24 02:10 TheR1sing3un

hudi hudi copied to clipboard

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

CI report:

hudi
hudi copied to clipboard