hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Order rows with same key before precombine

Open ghost opened this issue 1 year ago • 1 comments

I have an use case where I would like to use hudi. I have to process several inserts, updates and deletes indicated in a file. The file can have lots of rows for the same key and I have to combine it in order using a file. I have developed my own Payload but I can process the rows in order. I have used the options hoodie.datasource.write.precombine.field to indicate the precombine field and hoodie.payload.ordering.field to order by the same field but it didn't work. Also, I have tested using repartition and sort functions in the spark code before saving hudi but it didn't work either.

ghost avatar Apr 17 '24 14:04 ghost

Currently only internal HFile enables the sort of payloads within a file, for PARQUETs in the dataset table, the merge would break the sequence in anyway. Take https://github.com/apache/hudi/blob/6c6bddcef3ec383b08eb10f10ab0400f4edc41f4/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java#L54 for a reference.

I guess you might want to make the sorting configurable.

danny0405 avatar Apr 18 '24 00:04 danny0405