hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] /hoodie/temp Folder and contents not getting deleted

Open desaismi opened this issue 3 years ago • 6 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem. Upon writing to tables in s3 using Hudi, Hudi creates .hoodie/.temp/<commit_instant> artifacts in the metadata folder folder for the table. After write is complete, the temp artifacts get deleted along with the .temp/ folder. For a couple of our tables, we have noticed the temp artifacts never got deleted. We want to figure out why this occurred, and if it's safe to manually delete the artifacts remaining from past writes.

  • hoodie.datasource.write.operation is upsert for these operations
  • Hudi 0.8.0

To Reproduce

Steps to reproduce the behavior:

Not sure, we are writing to the same table every 10 minutes consistently and seeing this occur once for a couple of tables

Expected behavior

We expect that the temp artifacts are deleted after each write to a table

Environment Description

  • Hudi version : 0.8.0

  • Spark version : Spark 2.4.7

  • Hive version : Hive 2.3.7

  • Hadoop version : Amazon 2.10.1

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : No

desaismi avatar Jul 08 '22 20:07 desaismi

Hudi stores marker files in temp folder for tracking uncommitted data files. My question is had those commit_instant been done? hence you can query some records on the condition that _hoodie_commit_time = commit_instant or find the instant in the '.hoodie/ folder'

fengjian428 avatar Jul 09 '22 18:07 fengjian428

The instant 20220502180145 is in the .hoodie/.temp/ folder, but not in the .hoodie/ folder. I also see no records when querying _hoodie_commit_time = 20220502180145

desaismi avatar Jul 11 '22 16:07 desaismi

@fengjian428 : can you follow up here please.

nsivabalan avatar Aug 09 '22 21:08 nsivabalan

@desaismi could you try the latest version to check whether this issue still exists?

fengjian428 avatar Aug 10 '22 02:08 fengjian428

Hello, @fengjian428 we have some dependencies on our data pipeline that makes upgrading to the latest version non-trivial. Is this a known issue for 0.8.0?

If it's a rare intermittent issue, would it be safe to manually remove this marker file from the temp folder?

desaismi avatar Aug 11 '22 00:08 desaismi

yeah, it should be safe to remove the marker files if no relevant inflight instant in the timeline

fengjian428 avatar Aug 11 '22 03:08 fengjian428

We had a bug around compaction not cleaning up the marker files which was fixed in 0.10.0 https://github.com/apache/hudi/pull/3576 So, yes, we do know of some situations where markers files were not cleaned up.

nsivabalan avatar Aug 12 '22 02:08 nsivabalan

yes, if there are no matching commit files in /.hoodie/, you can remove the directories from /.hoodie/.temp folder.

nsivabalan avatar Aug 12 '22 02:08 nsivabalan

feel free to close out the issue, if you don't have any follow ups.

nsivabalan avatar Aug 12 '22 02:08 nsivabalan