hudi
hudi copied to clipboard
[SUPPORT] Properties file corruption caused by write failure
Describe the problem you faced
Dear community,
Recently I discovered a case: a write failure can cause the hoodi.properties file corrupted.
Problem site:
It causes other write tasks to fail.
The process in which this situation occurs is as follows:
-
Executing the commit will trigger the maybeDeleteMetadataTable process.(If need)
-
An exception occurred during the following process, causing the properties file write to fail.
File status:properties error(len=0) properties_backup error-free
-
Then it triggers rollback.
-
Since the table version cannot be correctly obtained at this point, it triggers an upgrade from 0 to 6.
File status:properties error(len=0) properties_backup removed
- Attempt to create a properties_backup file
I think that we should not only check if the hoodie.properties file exists when performing recoverIfNeeded, we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file. Any suggestion?
Environment Description
-
Hudi version : 0.14.0
-
Spark version :2.4
-
Hadoop version :2.6
-
Storage (HDFS/S3/GCS..) :HDFS
Stacktrace Caused by: org.apache.hudi.exception.HoodieException: Error updating table configs. at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:91) at org.apache.hudi.internal.HoodieDataSourceInternalWriter.commit(HoodieDataSourceInternalWriter.java:91) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:76) ... 69 more Suppressed: java.lang.IllegalArgumentException: hoodie.table.name property needs to be specified at org.apache.hudi.common.table.HoodieTableConfig.generateChecksum(HoodieTableConfig.java:523) at org.apache.hudi.common.table.HoodieTableConfig.getOrderedPropertiesWithTableChecksum(HoodieTableConfig.java:321) at org.apache.hudi.common.table.HoodieTableConfig.storeProperties(HoodieTableConfig.java:339) at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:438) at org.apache.hudi.common.table.HoodieTableConfig.delete(HoodieTableConfig.java:481) at org.apache.hudi.table.upgrade.UpgradeDowngrade.run(UpgradeDowngrade.java:151) at org.apache.hudi.client.BaseHoodieWriteClient.tryUpgrade(BaseHoodieWriteClient.java:1399) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1255) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1296) at org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:769) at org.apache.hudi.internal.DataSourceInternalWriterHelper.abort(DataSourceInternalWriterHelper.java:99) at org.apache.hudi.internal.HoodieDataSourceInternalWriter.abort(HoodieDataSourceInternalWriter.java:96) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:82) ... 69 more Caused by: org.apache.hudi.exception.HoodieIOException: Error updating table configs. at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:466) at org.apache.hudi.common.table.HoodieTableConfig.update(HoodieTableConfig.java:475) at org.apache.hudi.common.table.HoodieTableConfig.setMetadataPartitionState(HoodieTableConfig.java:816) at org.apache.hudi.common.table.HoodieTableConfig.clearMetadataPartitions(HoodieTableConfig.java:847) at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:1396) at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:275) at org.apache.hudi.table.HoodieTable.maybeDeleteMetadataTable(HoodieTable.java:995) at org.apache.hudi.table.HoodieSparkTable.getMetadataWriter(HoodieSparkTable.java:116) at org.apache.hudi.table.HoodieTable.getMetadataWriter(HoodieTable.java:947) at org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:359) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:285) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:236) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211) at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:88) ... 71 more Caused by: java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:3520) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:3498) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:3690) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:3625) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115) at org.apache.hudi.common.fs.SizeAwareFSDataOutputStream.close(SizeAwareFSDataOutputStream.java:75) at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:449) ... 84 more
The update to properties file should be atomic, and we already do that for HoodieTableConfig.modify, but it just throws for writer if any exception happens, the reader would still work by reading the back_up file.
we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file.
+1 for this, we need to strenthen the handling of the properties file exception for the invoker.
@danny0405 My current understanding is as follows:
- The properties_backup is a copy of the original properties.
- The expected outcome is that original properties should be the same as properties_backup. Can we check if original properties is error-free by comparing file sizes?
Can we check if original properties is error-free by comparing file sizes?
We have a check-sum in the properties file.
@danny0405 Sounds good. Can I optimize the decision-making process here?
Sure, would be glad to review your fix.
@Ytimetravel Did you got a chance to work on this? Do we have any JIRA for the same?
sorry, I am not sure if I fully understand how exactly we got into corrupt state.
From what I see createMetaClient(true) fails. But if we chase the chain of calls, its ends up with https://github.com/apache/hudi/blob/3a57591152065ddb317c5fe67bab8163730f1e73/hudi-common/src/main/java/org/apache/hudi/common/util/ConfigUtils.java#L541
which actually accounts for reading from either of back up or original property file.
can you help me understand a bit more.
Hi @Ytimetravel
Is there any update on this issue?