parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

ParquetFileWriter in invalid state after network outage in end(metadata)

Open fpetersen-gl opened this issue 5 months ago • 4 comments

Describe the bug, including details regarding any error messages, version, and platform.

Related to apache/iceberg#13508, possibly to #1971,

We're using parquet-java 1.15.2 as part of iceberg, uploading data to S3. The data is flushed to storage in configurable intervals.

Description

If a short network interruption happens exactly while writing and uploading files, the ParquetFileWriter is already transitioned into the state ENDED, even though the file has not been written successfully. Another call to close the (iceberg-) writer results in an exception of the ParquetFileWriter, being in an invalid state.

Stacktrace:

java.io.UncheckedIOException: Failed to flush row group
	at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:225)
	at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:257)
	at org.apache.iceberg.io.DataWriter.close(DataWriter.java:82)
	at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:126)
	at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:156)
	at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
	at org.apache.iceberg.io.FanoutWriter.closeWriters(FanoutWriter.java:82)
	at org.apache.iceberg.io.FanoutWriter.close(FanoutWriter.java:74)
	at org.apache.iceberg.io.FanoutDataWriter.close(FanoutDataWriter.java:31)
	at org.apache.iceberg.parquet.TestParquetWriter.testParquetWriterWithFailingIO(TestParquetWriter.java:113)
[... Junit/JDK classes ...]
Caused by: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: ENDED
	at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:250)
	at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:224)
	at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:586)
	at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:215)
	... 100 more

Possible Solution

Every time the internal field state is updated, it is done in the very beginning of a method, which is too early. If later in the method's code an exception is thrown, not all logic has been executed successfully, leaving the writer in an invalid state. But simply moving the transition in state to the end of the method could result in executing the code multiple times, if a retry-mechanism calls the method multiple times - that must also be avoided. One possibility would be to introduce more internal states. This would allow to track the state of the writer in more detail, which again makes it more resilient to retries.

Component(s)

Core

fpetersen-gl avatar Jul 17 '25 08:07 fpetersen-gl

Thanks for reporting this! Do you want to create a PR to fix this?

wgtmac avatar Jul 21 '25 05:07 wgtmac

Hi @wgtmac ! I'm afraid that I don't know too many details of parquet's internals. Especially handling the internal state of the writer seems to be a delicate endeavor to me. If there was someone with more knowledge of the internals, I'd rather leave this to that person.

fpetersen-gl avatar Jul 21 '25 06:07 fpetersen-gl

Have raised a patch here https://github.com/apache/parquet-java/pull/3269

ArnavBalyan avatar Aug 17 '25 08:08 ArnavBalyan

Shouldn't Iceberg recover in this case? How would we safely recover from this exception?

Fokko avatar Aug 27 '25 13:08 Fokko