hudi
hudi copied to clipboard
[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table
In some cases, a MOR table's latest (or view at time-travel specified instant) file-slices all have only base-file but empty log-files. When performs Snapshot-Query for these tables, we can regard it as MOR-ReadOptimized-Query and provide a HadoopFsRelation to Spark.
Change Logs
- regard mor snapshot query with all base-file-only table as mor read-optimized query Describe context and summary for this change. Highlight if any code was copied.
Impact
none Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
low If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed
Is this change related: https://github.com/apache/hudi/pull/12080 ?
Is this change related: #12080 ?
#12080 is optimizing filter pushdown for HoodieBaseRelation by reducing unnecessary columns. My changes focus on regard [MergeOnReadRelation with all base-file-only file-slices] as BaseFileOnlyRelation so that we can fallback it to HadoopFsRelation. Spark has many optimizations for HadoopFsRelation which can improve our query performance.
@hudi-bot run azure
CI report:
- 083aff297b487dac772c8d22ca90d28191288e8b UNKNOWN
- 0802e2bd04ed640cbf77ce10b9769b43d2995087 Azure: FAILURE Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways
We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways
Do you mean to use NewHoodieParquetFileFormat?
It looks like still an experimental feature. In many real scene for our product environment, we still query with relation implementations.
IMO, maybe we could introduce [MOR-SNAPSHOT-QUERY-FALLBACK-TO-HadoopFsRelation] with relation implementations, rather than directly changing to apply
NewHoodieParquetFileFormat.
Looking forward to your reply!