hudi [SUPPORT] Data loss due to incorrect selection of log file during compaction

Dear community:

I found an issue while using hudi, which is that there may be data loss when compacting incremental data.Through investigation, it is found that the problem is introduced by the following bugfix：[[HUDI-6758]] Detecting and skipping Spurious log blocks with MOR reads (#9545).After testing, it was found that choosing to speculate on the files generated by execution seems to have a probability of causing data loss.

Expected behavior

While retaining this optimization, it is also necessary to ensure that the selected files are always correct to ensure the accuracy of the data.

Environment Description

Hudi version : 0.14.0

Additional context

The main logic that led to this issue is as follows： I temporarily avoided this issue by retaining all log files with the same number of blocks. I would like to know if there is a better way to solve this problem.

Mar 03 '24 09:03 Ytimetravel

cc @nsivabalan for taking care of this issue~

Mar 04 '24 00:03 danny0405

More detailed information about this case： There are 28872 records of commited log file. The number of merged records is 20400 By the compaction plan file, it can be seen that this write operation creates 2 log files in this bucket The number of merged records after deleting the speculated log file is correct

Mar 05 '24 05:03 Ytimetravel

the speculative task was killed

Mar 05 '24 06:03 Ytimetravel

thanks, will take a look. sorry about the data loss.

Mar 05 '24 07:03 nsivabalan

hey @Ytimetravel : Do you have pipelines where spark speculative execution is disabled. Trying to gauge the burst radius of the impact. Do we see it even w/o speculative execution. Or only when speculative execution.

Mar 05 '24 16:03 nsivabalan

And do you happened to have the data intact. We can enable debug logging and see exactly whats going wrong. If you have the data, I can sync up w/ you directly https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg my handle is "shivNarayan". Feel free to ping me directly. I would like to get to the bottom of this asap. since its due to spark speculative execution, it might be tough to reproduce by ourselves.

Mar 05 '24 16:03 nsivabalan

One suspicion I have is, when there are two sequence of log blocks where both have similar number of log blocks, we had to choose one of them. but we can't deterministically say which sequence is good one, we choose a random one and chances that we chose a wrong sequence. But curious how could that end up in data loss though.

Mar 05 '24 16:03 nsivabalan

So to summarize the issue, for the file group of interest, if not for spark speculative execution, only one log file(file ending w/ 24780) should have been added. but w/ spark speculative execution, we found two log files(file ending w/ 24780 and file ending w/ 25536) were added.

The compaction plan shows two log files.

From the reader side, hudi checks for log block sequences, and since both sequences have just 1 log file, hudi chooses one of them and there is no determinism here.

But I am curious how the 2nd log file(file ending w/ 25536) was fully formed w/o any corruptness.

Let me also illustrate w/ a simple example.

FG1 base file: has 10k records.

Commit2: adds 5k new inserts. In happy path, this commit would have added a new log file w/ 5k inserts. and the next compaction would produce a parquet file 15k records.

Now lets simulate the issue reported in this ticket. Due to spark speculative execution, two log files were added. one of the log file contains 5k records while the other contains 2k records(2nd log file). From the screenshots shared, both log files have same log version number. (So, it can't be a scenario where a single append handle wrote two log files.)

Lets call it as logfile1 and logfile2. Compaction will include both log files in the plan.

And w/ our fix in https://github.com/apache/hudi/pull/9545, the log record reader does prune some log blocks. In this case, if both log blocks are deduced to be valid data block(not corrupted), chances that our log record reader could have chosen just log file2 and dropped logfile1 to process. (But if log file2 was deemed to be corrupted, hudi would have chosen log file1. )

Assuming 2nd log file is properly formed, this would end up resulting in a parquet file w/ 12k records. Causing the data loss.

But as I called out, this is possible only if logfile2 is not corrupted. There is a chance that log file2 is not corrupted, but it should be containing all 5k records as log file1. But if it only contains a subset of records compared to log file1, we are not sure how we could have ended up in that state.

Mar 05 '24 17:03 nsivabalan

Sorry about lot of follow up questions. Can you tell us what storage scheme you are using. Partial write failures should not happen w/ S3 or other cloud stores. if it had been hdfs, we cold see partial write failures. We are trying to gauge if the 2nd log file was properly formed or was it corrupted due to partial write failure.

If you have a backup of the data, let us know. We can share some tool that can spit out info about the log files (valid log blocks, no of valid records, etc) and might help us in our triaging.

Mar 06 '24 00:03 nsivabalan

Hey, I wrote a tool that could help us spit out some meta info about our log blocks and records. https://github.com/nsivabalan/hudi/tree/printAllVersionsOfRecordTool here is the branch.

Can you help us run the tool and share us the output.

Its a spark submit command. Its going to log some info about the log files we are interested in.

sample command

./bin/spark-submit --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --class org.apache.hudi.utilities.PrintRecordsTool PATH_TO_BUNDLE/hudi-utilities-bundle_2.12-0.15.0-SNAPSHOT.jar --props /tmp/props.in --base-path /tmp/hudi_trips_mor/ --partition-path asia/india/chennai  --file-id c3ef010f-61ae-4aa3-a033-25b278da17c6-0  --base-instant-time 20240302002723362 --print-log-blocks-info

cat /tmp/props.in 
hoodie.datasource.write.recordkey.field=uuid
hoodie.datasource.write.partitionpath.field=partitionpath
hoodie.datasource.write.precombine.field=ts

Ensure you set the right values for partition path, fileID and the base instant time. This should help w/ our triaging

Mar 06 '24 01:03 nsivabalan

Thank you very much for following up on the issue and providing feedback. I will use the tool you provided to obtain some meta info about our log blocks and records, and will get back to you later~

Mar 06 '24 02:03 Ytimetravel

@nsivabalan @Ytimetravel Another data loss case caused by the whole stage retry.

There are 4 cases that the task retry:

Task is slow, another speculation task is retried
The task failed and retry
The stage failed and retry
The executor failed and retry

For the third point, which stage retried. Task attempt number might be back to the original value. Using attempt number to identify the block is not enough to handle this case. It might leads to wrong result to compare blocks size of each attempt no. We might need to using stageAttemptNumber and AttemptNumber to identify it, or other solution. WDYT?

Mar 24 '24 04:03 beyond1920

@Ytimetravel Did you got a chance to run that tool and collect meta info about our log blocks and records.

Apr 01 '24 12:04 ad1happy2go

@ad1happy2go I'm sorry for just seeing your message, I didn't succeed in using that tool.

Apr 15 '24 06:04 Ytimetravel

https://github.com/apache/hudi/pull/10922

Sep 13 '24 00:09 nsivabalan

can we close out the issue

Sep 13 '24 00:09 nsivabalan

hudi hudi copied to clipboard

[SUPPORT] Data loss due to incorrect selection of log file during compaction

hudi
hudi copied to clipboard