hudi
hudi copied to clipboard
[SUPPORT] Order rows with same key before precombine
I have an use case where I would like to use hudi. I have to process several inserts, updates and deletes indicated in a file. The file can have lots of rows for the same key and I have to combine it in order using a file. I have developed my own Payload but I can process the rows in order. I have used the options hoodie.datasource.write.precombine.field to indicate the precombine field and hoodie.payload.ordering.field to order by the same field but it didn't work. Also, I have tested using repartition and sort functions in the spark code before saving hudi but it didn't work either.
Currently only internal HFile enables the sort of payloads within a file, for PARQUETs in the dataset table, the merge would break the sequence in anyway. Take https://github.com/apache/hudi/blob/6c6bddcef3ec383b08eb10f10ab0400f4edc41f4/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java#L54 for a reference.
I guess you might want to make the sorting configurable.