hudi
hudi copied to clipboard
[HUDI-7378] Fix Spark SQL DML with custom key generator
Change Logs
Describe context and summary for this change. Highlight if any code was copied.
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
- is there any change to partitions in
hoodie.proerties? Do we now write it asfield1:type,field2:type2when using CustomKeyGenerator?
There is no change to the table configs in hoodie.properties, i.e., the hoodie.table.partition.fields contains the comma-separated list of partition field names like "segment,ts" (no type for custom key generator). This PR opens the opportunity to override the hoodie.datasource.write.partitionpath.field with SET TBLPROPERTIES at the table level in the Spark catalog, so that SQL DML can derive the correct write config of the partition fields (e.g., "segment:simple,ts:timestamp" instead of "segment,ts").
- Thanks for adding extensive tests. Can you please look into the failures? They seem related to the patch.
Failures for Spark 3.2 and above are fixed. I'm looking into failures for older Spark versions.
I like that this has the benefit of not breaking tables with their existing hoodie.table.recordkey.fields, but I am curious about any other approaches you thought about. From you test code, it looks like we can't use
partitioned by (dt:int,idk:string)when creating the table. I don't think that should block this pr from landing, but in the documentation for SQL: https://hudi.apache.org/docs/sql_ddl#create-partitioned-table I think we should add an example
Good point. I tried partitioned by statement but it did not work either, due to the same the write config of the partition fields. But you're right that adding a new table config indicating the partition field types should solve the problem fundamentally. We should update the SQL docs on any gaps here.
Also, I think think this change will help us to fix partition pruning which currently does not work with timestamp keygen: https://issues.apache.org/jira/browse/HUDI-6614
Right.
CI report:
- 805ba35b65afbb1daccbcf00291fd520a69c5584 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build