hudi [SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table

Describe the problem you faced

The hudi job runs fine for an hour but then crashes after a Warning about Clean Action failure and subsequently raising an exception org.apache.hudi.exception.HoodieIOException: Could not check if s3a://xyz-bucket/spark-warehouse/xyz-table-name/.hoodie/metadata is a valid table

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version : 0.13.1
Spark version : 3.3.0
Java version : 1.8
Storage (HDFS/S3/GCS..) : S3A bucket
Running on Docker? (yes/no) : Yes, Spark on kubernetes

Hudi options: hudi_options = { hoodie.datasource.write.keygenerator.class: "org.apache.hudi.keygen.ComplexKeyGenerator" hoodie.datasource.write.hive_style_partitioning: "true" hoodie.datasource.write.operation: "insert" hoodie.datasource.write.table.type: "MERGE_ON_READ" hoodie.update.partial.fields: "true" hoodie.upsert.shuffle.parallelism: "2" hoodie.insert.shuffle.parallelism: "2" hoodie.index.bloom.num_entries: "60000" hoodie.index.bloom.fpp: "0.000000001" hoodie.clean.automatic: "true" hoodie.clean.async: "true" hoodie.compaction.lazy.block.read: "false" hoodie.enable.data.skipping: "true" hoodie.logfile.max.size: "1073741824" hoodie.parquet.small.file.limit: "104857600" hoodie.parquet.max.file.size: "125829120" hoodie.parquet.block.size: "125829120" hoodie.datasource.hive_sync.enable: "false" hoodie.datasource.hive_sync.database: "xyz_staging" hoodie.datasource.hive_sync.use_jdbc: "true" hoodie.datasource.hive_sync.jdbcurl: "jdbc:hive2://test.net:10000" hoodie.datasource.hive_sync.metastore.uris: "thrift://test.net:9083" hoodie.datasource.hive_sync.partition_extractor_class: "org.apache.hudi.hive.MultiPartKeysValueExtractor" }

Stacktrace

23/11/17 04:00:55 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20231117040045263__clean__REQUESTED]
org.apache.hudi.exception.HoodieIOException: Could not check if s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata is a valid table
	at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59)
	at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:137)
	at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
	at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
	at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
	at org.apache.hudi.common.table.HoodieTableMetaClient.reload(HoodieTableMetaClient.java:174)
	at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:153)
	at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:838)
	at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:918)
	at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:68)
	at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
	at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:68)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:221)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:187)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$8(CleanActionExecutor.java:256)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
	at org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:250)
	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:263)
	at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:557)
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:759)
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:731)
	at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InterruptedIOException: getFileStatus on s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata/.hoodie: com.amazonaws.AbortedException: 
	at org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:395)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:201)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3554)
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410)
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404)
	at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51)
	... 25 more
Caused by: com.amazonaws.AbortedException: 
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5437)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5384)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1367)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2545)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2533)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2513)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3776)
	... 36 more
Caused by: com.amazonaws.http.timers.client.SdkInterruptedException
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:909)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1103)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
	... 49 more
23/11/17 04:00:55 INFO SparkUI: Stopped Spark web UI at 
23/11/17 04:00:55 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
23/11/17 04:00:55 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
23/11/17 04:00:55 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
23/11/17 04:00:55 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/11/17 04:00:55 INFO MemoryStore: MemoryStore cleared
23/11/17 04:00:55 INFO BlockManager: BlockManager stopped
23/11/17 04:00:56 INFO BlockManagerMaster: BlockManagerMaster stopped
23/11/17 04:00:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/11/17 04:00:56 INFO SparkContext: Successfully stopped SparkContext```

Nov 17 '23 04:11 shubhamn21

@shubhamn21 Did you tried setting fs.s3a.connection.maximum to a higher value ?

Nov 17 '23 05:11 ad1happy2go

@shubhamn21 Were you able to resolve this issue?

Nov 22 '23 06:11 ad1happy2go

@ad1happy2go I did try increasing the connection maximum to 300 and thread.max to 30 but it still failed with same error.

Nov 30 '23 19:11 shubhamn21

I also see subsequent rollback failures causing the job to fail. The job is often halting after failure warning for clean operation. I tried with different S3 buckets and also single/multiple executors but yet not able to let the job continue running for more than few hours.

Dec 04 '23 08:12 shubhamn21

23/12/04 08:00:23 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20231204075005981__clean__INFLIGHT] java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)

Hi @nsivabalan , Tagging you here as I had seen you as an assignee for similar issue. I am seeing the above clean action warning which prompts subsequent failures in job. Has this got something to do with s3 performance?

Dec 04 '23 17:12 shubhamn21

Disabling the clean action by setting hoodie.clean.automatic as false has helped for now. I'll be creating a daemon/cron-job that can clean the cold metadata in parallel but not touch the hot metadata. Please let me know if there is a fix for this or root cause is known.

Dec 12 '23 21:12 shubhamn21

@shubhamn21 Sorry for the delay here. After increasing the value of fs.s3a.connection.maximum also, did you faced the same exception? We also wanted to understand the root cause for this.

By any chance, Do you have a situation when multiple writers are trying to write to this table?

Dec 19 '23 10:12 ad1happy2go

Hi @ad1happy2go, Thanks for responding. Yes, I did have multiple executors at one point writing to the table. But I recently limited my deployments to 1 executor since I realized I need to set up a meta-storage (hive) as lock providers.

Can multiple writers cause clean-action failures too if we do not have hive/dynamo-db set up?

Jan 02 '24 22:01 shubhamn21

@shubhamn21 Thanks for the update. Yes, having multiple writer without lock provider can cause inconsistent behaviour and create this kind of issues.

Jan 03 '24 05:01 ad1happy2go

Thank you! Closing this issue - I'll set up hive metastore to avoid these issues.

Jan 05 '24 19:01 shubhamn21

hudi hudi copied to clipboard

[SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table

hudi
hudi copied to clipboard