hudi icon indicating copy to clipboard operation
hudi copied to clipboard

there are duplicated records if copied one partition data file from another s3 bucket

Open njalan opened this issue 9 months ago • 4 comments

If I copied the whole table file from another s3 bucker folder and then continue steaming upsert process, there is no duplicated records. It is working fine. But If I removed all the files from one partition and only copied one partition file from same table in another s3 bucket and then continue steaming upsert process, After that there are some duplicated records. How to fix this issue? Is there any way to update the metadata? I am using hudi 0.9.0.

njalan avatar May 14 '24 03:05 njalan

@njalan What do you mean by "copied one partition file from same table". Are you referring copying the parquet files?

ad1happy2go avatar May 14 '24 06:05 ad1happy2go

@ad1happy2go I copied all the files from that partition folder not only parquet files.

njalan avatar May 14 '24 07:05 njalan

but partition directory only contains the parquet files AND log files (in case of MOR). Right?

If you just copy partition files, how you are updating the .hoodie timeline?

ad1happy2go avatar May 14 '24 07:05 ad1happy2go

@ad1happy2go Yes only have the parquet files. is there any way to manually update the meta data?

njalan avatar May 14 '24 07:05 njalan

@njalan No there is no way and we dont recommend also. Best way is to instead of moving use spark to write code and create another Hudi Table with partitions you need.

ad1happy2go avatar May 15 '24 08:05 ad1happy2go

@ad1happy2go Got it thanks

njalan avatar May 17 '24 02:05 njalan