hadoop-connectors
hadoop-connectors copied to clipboard
The connector doesn't provide atomic file creations if a thread is interrupted
We have encountered a scenario where the close method is not called on output streams of GCS connector, however the file itself is being flushed, so it becomes visible. The issue happens when the thread that does writes is interrupted. Apparently, the thread interruption triggers close method in the output stream. The expected behavior here is that the file will never be visible in case of any thread interruption. We expect this behavior to be fixed in GCS, so the atomic file creation is guaranteed by the output stream close. Please find the repro and stacktrace below.
Example code to reproduce the issue:
Path path = new Path("gs://...");
FileSystem fs = path.getFileSystem(conf);
OutputStream out = fs.create(path);
Thread.currentThread().interrupt();
try {
out.write("foo".getBytes());
out.close();
} catch(IOException e) {
Thread.interrupted();
e.printStackTrace();
Thread.sleep(1000); // Wait for the other thread to complete the request
System.out.println(fs.exists(path)); // will output `true`
System.out.println("content: " + IOUtils.toString(fs.open(path))); // The content is empty
}
Stacktrace of the issue:
at java.io.PipedOutputStream.close(PipedOutputStream.java:175)
at java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:469)
at java.nio.channels.spi.AbstractInterruptibleChannel$1.interrupt(AbstractInterruptibleChannel.java:165)
- locked <0xc2f> (a java.lang.Object)
at java.nio.channels.spi.AbstractInterruptibleChannel.begin(AbstractInterruptibleChannel.java:173)
at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:457)
- locked <0xc3b> (a java.lang.Object)
at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.write(BaseAbstractGoogleAsyncWriteChannel.java:136)
- locked <0xc3c> (a com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$2)
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
at java.nio.channels.Channels.writeFully(Channels.java:101)
at java.nio.channels.Channels.access$000(Channels.java:61)
at java.nio.channels.Channels$1.write(Channels.java:174)
@medb Please check this out, thanks!
this is consistent with hadoop FS spec, which says when create(path) returns, exists(path) must hold
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#FSDataOutputStream_create.28Path.2C_....29
the fact that s3 doesn't do this is a divergence there.
for atomicity, write to a different location and rename the file. that is atomic in GCS (but not s3 though the s3a connector)