hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Data loss due to incorrect selection of log file during compaction

Open Ytimetravel opened this issue 1 year ago • 14 comments

Dear community:

I found an issue while using hudi, which is that there may be data loss when compacting incremental data.Through investigation, it is found that the problem is introduced by the following bugfix:[[HUDI-6758]] Detecting and skipping Spurious log blocks with MOR reads (#9545).After testing, it was found that choosing to speculate on the files generated by execution seems to have a probability of causing data loss.

Expected behavior

While retaining this optimization, it is also necessary to ensure that the selected files are always correct to ensure the accuracy of the data.

Environment Description

  • Hudi version : 0.14.0

Additional context

The main logic that led to this issue is as follows: image I temporarily avoided this issue by retaining all log files with the same number of blocks. I would like to know if there is a better way to solve this problem.

Ytimetravel avatar Mar 03 '24 09:03 Ytimetravel

cc @nsivabalan for taking care of this issue~

danny0405 avatar Mar 04 '24 00:03 danny0405

More detailed information about this case: There are 28872 records of commited log file. image The number of merged records is 20400 image By the compaction plan file, it can be seen that this write operation creates 2 log files in this bucket image The number of merged records after deleting the speculated log file is correct image image

Ytimetravel avatar Mar 05 '24 05:03 Ytimetravel

the speculative task was killed image

Ytimetravel avatar Mar 05 '24 06:03 Ytimetravel

thanks, will take a look. sorry about the data loss.

nsivabalan avatar Mar 05 '24 07:03 nsivabalan

hey @Ytimetravel : Do you have pipelines where spark speculative execution is disabled. Trying to gauge the burst radius of the impact. Do we see it even w/o speculative execution. Or only when speculative execution.

nsivabalan avatar Mar 05 '24 16:03 nsivabalan

And do you happened to have the data intact. We can enable debug logging and see exactly whats going wrong. If you have the data, I can sync up w/ you directly https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg my handle is "shivNarayan". Feel free to ping me directly. I would like to get to the bottom of this asap. since its due to spark speculative execution, it might be tough to reproduce by ourselves.

nsivabalan avatar Mar 05 '24 16:03 nsivabalan

One suspicion I have is, when there are two sequence of log blocks where both have similar number of log blocks, we had to choose one of them. but we can't deterministically say which sequence is good one, we choose a random one and chances that we chose a wrong sequence. But curious how could that end up in data loss though.

nsivabalan avatar Mar 05 '24 16:03 nsivabalan

So to summarize the issue, for the file group of interest, if not for spark speculative execution, only one log file(file ending w/ 24780) should have been added. but w/ spark speculative execution, we found two log files(file ending w/ 24780 and file ending w/ 25536) were added.

The compaction plan shows two log files.

From the reader side, hudi checks for log block sequences, and since both sequences have just 1 log file, hudi chooses one of them and there is no determinism here.

But I am curious how the 2nd log file(file ending w/ 25536) was fully formed w/o any corruptness.

Let me also illustrate w/ a simple example.

FG1 base file: has 10k records.

Commit2: adds 5k new inserts. In happy path, this commit would have added a new log file w/ 5k inserts. and the next compaction would produce a parquet file 15k records.

Now lets simulate the issue reported in this ticket. Due to spark speculative execution, two log files were added. one of the log file contains 5k records while the other contains 2k records(2nd log file). From the screenshots shared, both log files have same log version number. (So, it can't be a scenario where a single append handle wrote two log files.)

Lets call it as logfile1 and logfile2. Compaction will include both log files in the plan.

And w/ our fix in https://github.com/apache/hudi/pull/9545, the log record reader does prune some log blocks. In this case, if both log blocks are deduced to be valid data block(not corrupted), chances that our log record reader could have chosen just log file2 and dropped logfile1 to process. (But if log file2 was deemed to be corrupted, hudi would have chosen log file1. )

Assuming 2nd log file is properly formed, this would end up resulting in a parquet file w/ 12k records. Causing the data loss.

But as I called out, this is possible only if logfile2 is not corrupted. There is a chance that log file2 is not corrupted, but it should be containing all 5k records as log file1. But if it only contains a subset of records compared to log file1, we are not sure how we could have ended up in that state.

nsivabalan avatar Mar 05 '24 17:03 nsivabalan

Sorry about lot of follow up questions. Can you tell us what storage scheme you are using. Partial write failures should not happen w/ S3 or other cloud stores. if it had been hdfs, we cold see partial write failures. We are trying to gauge if the 2nd log file was properly formed or was it corrupted due to partial write failure.

If you have a backup of the data, let us know. We can share some tool that can spit out info about the log files (valid log blocks, no of valid records, etc) and might help us in our triaging.

nsivabalan avatar Mar 06 '24 00:03 nsivabalan

Hey, I wrote a tool that could help us spit out some meta info about our log blocks and records. https://github.com/nsivabalan/hudi/tree/printAllVersionsOfRecordTool here is the branch.

Can you help us run the tool and share us the output.

Its a spark submit command. Its going to log some info about the log files we are interested in.

sample command

./bin/spark-submit --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --class org.apache.hudi.utilities.PrintRecordsTool PATH_TO_BUNDLE/hudi-utilities-bundle_2.12-0.15.0-SNAPSHOT.jar --props /tmp/props.in --base-path /tmp/hudi_trips_mor/ --partition-path asia/india/chennai  --file-id c3ef010f-61ae-4aa3-a033-25b278da17c6-0  --base-instant-time 20240302002723362 --print-log-blocks-info
cat /tmp/props.in 
hoodie.datasource.write.recordkey.field=uuid
hoodie.datasource.write.partitionpath.field=partitionpath
hoodie.datasource.write.precombine.field=ts

Ensure you set the right values for partition path, fileID and the base instant time. This should help w/ our triaging

nsivabalan avatar Mar 06 '24 01:03 nsivabalan

Thank you very much for following up on the issue and providing feedback. I will use the tool you provided to obtain some meta info about our log blocks and records, and will get back to you later~

Ytimetravel avatar Mar 06 '24 02:03 Ytimetravel

@nsivabalan @Ytimetravel Another data loss case caused by the whole stage retry.

There are 4 cases that the task retry:

  • Task is slow, another speculation task is retried
  • The task failed and retry
  • The stage failed and retry
  • The executor failed and retry

For the third point, which stage retried. Task attempt number might be back to the original value. Using attempt number to identify the block is not enough to handle this case. It might leads to wrong result to compare blocks size of each attempt no. We might need to using stageAttemptNumber and AttemptNumber to identify it, or other solution. WDYT?

beyond1920 avatar Mar 24 '24 04:03 beyond1920

@Ytimetravel Did you got a chance to run that tool and collect meta info about our log blocks and records.

ad1happy2go avatar Apr 01 '24 12:04 ad1happy2go

@ad1happy2go I'm sorry for just seeing your message, I didn't succeed in using that tool.

Ytimetravel avatar Apr 15 '24 06:04 Ytimetravel

https://github.com/apache/hudi/pull/10922

nsivabalan avatar Sep 13 '24 00:09 nsivabalan

can we close out the issue

nsivabalan avatar Sep 13 '24 00:09 nsivabalan