S3 multipart uploads to text streams are committed on exception
Problem description
If I smart_open an S3 file in text mode, write some data, and then raise an exception, I expect the multipart upload to be terminated, and no key to be created. That's what I see for binary-mode smart_open streams.
This is almost the same issue as https://github.com/RaRe-Technologies/smart_open/issues/684, but in the context of using text streams, instead of compression. It suggests there might be some broader architectural fix, although I'm not sure what it would be.
Steps/code to reproduce the problem
Given this code:
with smart_open.open("s3://some_bucket/path/to/key.txt", "w") as fh:
fh.write("hello\n")
raise AssertionError()
I expect key.txt not to be created. Instead, I see it created with the content "hello\n".
For the almost-identical binary version:
with smart_open.open("s3://some_bucket/path/to/key.txt", "wb") as fh:
fh.write(b"hello\n")
raise AssertionError()
it works as intended.
The proximate problem is that while s3.MultipartWriter implements cancellation on exception, when we open in text mode, we wrap the object in a TextIOWrapper.
TextIOWrapper inherits from IOBase, and IOBase.__exit__ unconditionally calls self.close(), which completes the write and creates the partial object.
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
Linux-5.4.209-116.367.amzn2.x86_64-x86_64-with-glibc2.35
Python 3.10.8 (main, Feb 21 2023, 22:10:24) [GCC 11.3.0]
smart_open 6.3.0
Checklist
Before you create the issue, please make sure you have:
- [x] Described the problem clearly
- [x] Provided a minimal reproducible example, including any required data
- [x] Provided the version numbers of the relevant software
This issue affects SinglePartWriter as well.
The problem boils down to how garbage collection of both SinglepartWriter and MultipartWriter is being done. Both writers inherit from io.BufferedIOBase which calls self.close() on garbage collection when __del__() descriptor is invoked.
If writer was properly marked as closed on exception then calling del() by GC would not invoke close on the writer. Unfortunately current implementation of closed look like this:
@property
def closed(self):
return self._buf is None
Hence it will always return True in case of unhandled exception. I suppose an easy way to fix it would be to set self._buf = None at the end of terminate() method of both writers.
The proximate problem is that while
s3.MultipartWriterimplements cancellation on exception, when we open in text mode, we wrap the object in a TextIOWrapper.
TextIOWrapperinherits fromIOBase, andIOBase.__exit__unconditionally callsself.close(), which completes the write and creates the partial object.
TextIoWrapper was patched in v7.0.0 ref https://github.com/piskvorky/smart_open/commit/42682e7eab45b01938b8964a7acec5dcd461b31e (#783)
So using the with-statement will now also abort upload on exception when in text mode.
I think that fixes this issue
@ddelange regardless of patches for TextIOWrapper implemented in v7.0.0 close() will be called on both writers when garbage collected, resulting in unwanted writes to S3.
@donsokolone Can you think of a way we can avoid the unwanted writes?
@mpenkov After doing some more troubleshooting I've created separate issue #819 as the problem I've mentioned affects only SinglepartWriter at the moment. Fix for that is quite simple and I will try to PR it today.