hudi
hudi copied to clipboard
[SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table
Describe the problem you faced
The hudi job runs fine for an hour but then crashes after a Warning about Clean Action failure
and subsequently raising an exception
org.apache.hudi.exception.HoodieIOException: Could not check if s3a://xyz-bucket/spark-warehouse/xyz-table-name/.hoodie/metadata is a valid table
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.13.1
-
Spark version : 3.3.0
-
Java version : 1.8
-
Storage (HDFS/S3/GCS..) : S3A bucket
-
Running on Docker? (yes/no) : Yes, Spark on kubernetes
Hudi options: hudi_options = { hoodie.datasource.write.keygenerator.class: "org.apache.hudi.keygen.ComplexKeyGenerator" hoodie.datasource.write.hive_style_partitioning: "true" hoodie.datasource.write.operation: "insert" hoodie.datasource.write.table.type: "MERGE_ON_READ" hoodie.update.partial.fields: "true" hoodie.upsert.shuffle.parallelism: "2" hoodie.insert.shuffle.parallelism: "2" hoodie.index.bloom.num_entries: "60000" hoodie.index.bloom.fpp: "0.000000001" hoodie.clean.automatic: "true" hoodie.clean.async: "true" hoodie.compaction.lazy.block.read: "false" hoodie.enable.data.skipping: "true" hoodie.logfile.max.size: "1073741824" hoodie.parquet.small.file.limit: "104857600" hoodie.parquet.max.file.size: "125829120" hoodie.parquet.block.size: "125829120" hoodie.datasource.hive_sync.enable: "false" hoodie.datasource.hive_sync.database: "xyz_staging" hoodie.datasource.hive_sync.use_jdbc: "true" hoodie.datasource.hive_sync.jdbcurl: "jdbc:hive2://test.net:10000" hoodie.datasource.hive_sync.metastore.uris: "thrift://test.net:9083" hoodie.datasource.hive_sync.partition_extractor_class: "org.apache.hudi.hive.MultiPartKeysValueExtractor" }
Stacktrace
23/11/17 04:00:55 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20231117040045263__clean__REQUESTED]
org.apache.hudi.exception.HoodieIOException: Could not check if s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata is a valid table
at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59)
at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:137)
at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
at org.apache.hudi.common.table.HoodieTableMetaClient.reload(HoodieTableMetaClient.java:174)
at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:153)
at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:838)
at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:918)
at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:68)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:68)
at org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:221)
at org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:187)
at org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$8(CleanActionExecutor.java:256)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:250)
at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:263)
at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:557)
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:759)
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:731)
at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InterruptedIOException: getFileStatus on s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata/.hoodie: com.amazonaws.AbortedException:
at org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:395)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:201)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3554)
at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410)
at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404)
at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51)
... 25 more
Caused by: com.amazonaws.AbortedException:
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleInterruptedException(AmazonHttpClient.java:868)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:746)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5437)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5384)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1367)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2545)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2533)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2513)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3776)
... 36 more
Caused by: com.amazonaws.http.timers.client.SdkInterruptedException
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:923)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.checkInterrupted(AmazonHttpClient.java:909)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1103)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
... 49 more
23/11/17 04:00:55 INFO SparkUI: Stopped Spark web UI at
23/11/17 04:00:55 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
23/11/17 04:00:55 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
23/11/17 04:00:55 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
23/11/17 04:00:55 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/11/17 04:00:55 INFO MemoryStore: MemoryStore cleared
23/11/17 04:00:55 INFO BlockManager: BlockManager stopped
23/11/17 04:00:56 INFO BlockManagerMaster: BlockManagerMaster stopped
23/11/17 04:00:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/11/17 04:00:56 INFO SparkContext: Successfully stopped SparkContext```
@shubhamn21 Did you tried setting fs.s3a.connection.maximum
to a higher value ?
@shubhamn21 Were you able to resolve this issue?
@ad1happy2go I did try increasing the connection maximum to 300 and thread.max to 30 but it still failed with same error.
I also see subsequent rollback failures causing the job to fail. The job is often halting after failure warning for clean operation. I tried with different S3 buckets and also single/multiple executors but yet not able to let the job continue running for more than few hours.
23/12/04 08:00:23 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20231204075005981__clean__INFLIGHT] java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
Hi @nsivabalan , Tagging you here as I had seen you as an assignee for similar issue. I am seeing the above clean action warning which prompts subsequent failures in job. Has this got something to do with s3 performance?
Disabling the clean action by setting hoodie.clean.automatic
as false
has helped for now. I'll be creating a daemon/cron-job that can clean the cold metadata in parallel but not touch the hot metadata.
Please let me know if there is a fix for this or root cause is known.
@shubhamn21 Sorry for the delay here. After increasing the value of fs.s3a.connection.maximum
also, did you faced the same exception? We also wanted to understand the root cause for this.
By any chance, Do you have a situation when multiple writers are trying to write to this table?
Hi @ad1happy2go, Thanks for responding. Yes, I did have multiple executors at one point writing to the table. But I recently limited my deployments to 1 executor since I realized I need to set up a meta-storage (hive) as lock providers.
Can multiple writers cause clean-action failures too if we do not have hive/dynamo-db set up?
@shubhamn21 Thanks for the update. Yes, having multiple writer without lock provider can cause inconsistent behaviour and create this kind of issues.
Thank you! Closing this issue - I'll set up hive metastore to avoid these issues.