hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Properties file corruption caused by write failure

Open Ytimetravel opened this issue 1 year ago • 3 comments

Describe the problem you faced Dear community, Recently I discovered a case: a write failure can cause the hoodi.properties file corrupted. Problem site: imageIt causes other write tasks to fail. The process in which this situation occurs is as follows:

  1. Executing the commit will trigger the maybeDeleteMetadataTable process.(If need) image

  2. An exception occurred during the following process, causing the properties file write to fail. image image image

File status:properties error(len=0) properties_backup error-free

  1. Then it triggers rollback. image image image

  2. Since the table version cannot be correctly obtained at this point, it triggers an upgrade from 0 to 6. image image image

File status:properties error(len=0) properties_backup removed

  1. Attempt to create a properties_backup file image image

I think that we should not only check if the hoodie.properties file exists when performing recoverIfNeeded, we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file. Any suggestion?

Environment Description

  • Hudi version : 0.14.0

  • Spark version :2.4

  • Hadoop version :2.6

  • Storage (HDFS/S3/GCS..) :HDFS

Stacktrace Caused by: org.apache.hudi.exception.HoodieException: Error updating table configs. at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:91) at org.apache.hudi.internal.HoodieDataSourceInternalWriter.commit(HoodieDataSourceInternalWriter.java:91) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:76) ... 69 more Suppressed: java.lang.IllegalArgumentException: hoodie.table.name property needs to be specified at org.apache.hudi.common.table.HoodieTableConfig.generateChecksum(HoodieTableConfig.java:523) at org.apache.hudi.common.table.HoodieTableConfig.getOrderedPropertiesWithTableChecksum(HoodieTableConfig.java:321) at org.apache.hudi.common.table.HoodieTableConfig.storeProperties(HoodieTableConfig.java:339) at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:438) at org.apache.hudi.common.table.HoodieTableConfig.delete(HoodieTableConfig.java:481) at org.apache.hudi.table.upgrade.UpgradeDowngrade.run(UpgradeDowngrade.java:151) at org.apache.hudi.client.BaseHoodieWriteClient.tryUpgrade(BaseHoodieWriteClient.java:1399) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1255) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1296) at org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:769) at org.apache.hudi.internal.DataSourceInternalWriterHelper.abort(DataSourceInternalWriterHelper.java:99) at org.apache.hudi.internal.HoodieDataSourceInternalWriter.abort(HoodieDataSourceInternalWriter.java:96) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:82) ... 69 more Caused by: org.apache.hudi.exception.HoodieIOException: Error updating table configs. at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:466) at org.apache.hudi.common.table.HoodieTableConfig.update(HoodieTableConfig.java:475) at org.apache.hudi.common.table.HoodieTableConfig.setMetadataPartitionState(HoodieTableConfig.java:816) at org.apache.hudi.common.table.HoodieTableConfig.clearMetadataPartitions(HoodieTableConfig.java:847) at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:1396) at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:275) at org.apache.hudi.table.HoodieTable.maybeDeleteMetadataTable(HoodieTable.java:995) at org.apache.hudi.table.HoodieSparkTable.getMetadataWriter(HoodieSparkTable.java:116) at org.apache.hudi.table.HoodieTable.getMetadataWriter(HoodieTable.java:947) at org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:359) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:285) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:236) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211) at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:88) ... 71 more Caused by: java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:3520) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:3498) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:3690) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:3625) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115) at org.apache.hudi.common.fs.SizeAwareFSDataOutputStream.close(SizeAwareFSDataOutputStream.java:75) at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:449) ... 84 more

Ytimetravel avatar Aug 27 '24 09:08 Ytimetravel

The update to properties file should be atomic, and we already do that for HoodieTableConfig.modify, but it just throws for writer if any exception happens, the reader would still work by reading the back_up file.

we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file.

+1 for this, we need to strenthen the handling of the properties file exception for the invoker.

danny0405 avatar Aug 28 '24 01:08 danny0405

@danny0405 My current understanding is as follows:

  1. The properties_backup is a copy of the original properties.
  2. The expected outcome is that original properties should be the same as properties_backup. Can we check if original properties is error-free by comparing file sizes?

Ytimetravel avatar Aug 29 '24 06:08 Ytimetravel

Can we check if original properties is error-free by comparing file sizes?

We have a check-sum in the properties file.

danny0405 avatar Aug 29 '24 08:08 danny0405

@danny0405 Sounds good. Can I optimize the decision-making process here?

Ytimetravel avatar Aug 30 '24 03:08 Ytimetravel

Sure, would be glad to review your fix.

danny0405 avatar Aug 31 '24 00:08 danny0405

@Ytimetravel Did you got a chance to work on this? Do we have any JIRA for the same?

ad1happy2go avatar Oct 22 '24 15:10 ad1happy2go

sorry, I am not sure if I fully understand how exactly we got into corrupt state.

From what I see createMetaClient(true) fails. But if we chase the chain of calls, its ends up with https://github.com/apache/hudi/blob/3a57591152065ddb317c5fe67bab8163730f1e73/hudi-common/src/main/java/org/apache/hudi/common/util/ConfigUtils.java#L541

which actually accounts for reading from either of back up or original property file.

can you help me understand a bit more.

nsivabalan avatar Nov 08 '24 23:11 nsivabalan

Hi @Ytimetravel

Is there any update on this issue?

rangareddy avatar Feb 10 '25 05:02 rangareddy