hudi [HUDI-7378] Fix Spark SQL DML with custom key generator

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Feb 04 '24 00:02 yihua

is there any change to partitions in hoodie.proerties? Do we now write it as field1:type,field2:type2 when using CustomKeyGenerator?

There is no change to the table configs in hoodie.properties, i.e., the hoodie.table.partition.fields contains the comma-separated list of partition field names like "segment,ts" (no type for custom key generator). This PR opens the opportunity to override the hoodie.datasource.write.partitionpath.field with SET TBLPROPERTIES at the table level in the Spark catalog, so that SQL DML can derive the correct write config of the partition fields (e.g., "segment:simple,ts:timestamp" instead of "segment,ts").

Thanks for adding extensive tests. Can you please look into the failures? They seem related to the patch.

Failures for Spark 3.2 and above are fixed. I'm looking into failures for older Spark versions.

Apr 11 '24 16:04 yihua

I like that this has the benefit of not breaking tables with their existing hoodie.table.recordkey.fields, but I am curious about any other approaches you thought about. From you test code, it looks like we can't use partitioned by (dt:int,idk:string) when creating the table. I don't think that should block this pr from landing, but in the documentation for SQL: https://hudi.apache.org/docs/sql_ddl#create-partitioned-table I think we should add an example

Good point. I tried partitioned by statement but it did not work either, due to the same the write config of the partition fields. But you're right that adding a new table config indicating the partition field types should solve the problem fundamentally. We should update the SQL docs on any gaps here.

Also, I think think this change will help us to fix partition pruning which currently does not work with timestamp keygen: https://issues.apache.org/jira/browse/HUDI-6614

Right.

Apr 12 '24 18:04 yihua

CI report:

805ba35b65afbb1daccbcf00291fd520a69c5584 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Apr 13 '24 00:04 hudi-bot

hudi hudi copied to clipboard

[HUDI-7378] Fix Spark SQL DML with custom key generator

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

CI report:

hudi
hudi copied to clipboard