smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

S3 multipart uploads to text streams are committed on exception

Open nelhage opened this issue 3 years ago • 5 comments

Problem description

If I smart_open an S3 file in text mode, write some data, and then raise an exception, I expect the multipart upload to be terminated, and no key to be created. That's what I see for binary-mode smart_open streams.

This is almost the same issue as https://github.com/RaRe-Technologies/smart_open/issues/684, but in the context of using text streams, instead of compression. It suggests there might be some broader architectural fix, although I'm not sure what it would be.

Steps/code to reproduce the problem

Given this code:

with smart_open.open("s3://some_bucket/path/to/key.txt", "w") as fh:
  fh.write("hello\n")
  raise AssertionError()

I expect key.txt not to be created. Instead, I see it created with the content "hello\n".

For the almost-identical binary version:

with smart_open.open("s3://some_bucket/path/to/key.txt", "wb") as fh:
  fh.write(b"hello\n")
  raise AssertionError()

it works as intended.

The proximate problem is that while s3.MultipartWriter implements cancellation on exception, when we open in text mode, we wrap the object in a TextIOWrapper.

TextIOWrapper inherits from IOBase, and IOBase.__exit__ unconditionally calls self.close(), which completes the write and creates the partial object.

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
Linux-5.4.209-116.367.amzn2.x86_64-x86_64-with-glibc2.35
Python 3.10.8 (main, Feb 21 2023, 22:10:24) [GCC 11.3.0]
smart_open 6.3.0

Checklist

Before you create the issue, please make sure you have:

  • [x] Described the problem clearly
  • [x] Provided a minimal reproducible example, including any required data
  • [x] Provided the version numbers of the relevant software

nelhage avatar Mar 08 '23 21:03 nelhage

This issue affects SinglePartWriter as well.

The problem boils down to how garbage collection of both SinglepartWriter and MultipartWriter is being done. Both writers inherit from io.BufferedIOBase which calls self.close() on garbage collection when __del__() descriptor is invoked.

If writer was properly marked as closed on exception then calling del() by GC would not invoke close on the writer. Unfortunately current implementation of closed look like this:

@property
def closed(self):
    return self._buf is None

Hence it will always return True in case of unhandled exception. I suppose an easy way to fix it would be to set self._buf = None at the end of terminate() method of both writers.

donsokolone avatar Apr 19 '24 17:04 donsokolone

The proximate problem is that while s3.MultipartWriter implements cancellation on exception, when we open in text mode, we wrap the object in a TextIOWrapper.

TextIOWrapper inherits from IOBase, and IOBase.__exit__ unconditionally calls self.close(), which completes the write and creates the partial object.

TextIoWrapper was patched in v7.0.0 ref https://github.com/piskvorky/smart_open/commit/42682e7eab45b01938b8964a7acec5dcd461b31e (#783)

So using the with-statement will now also abort upload on exception when in text mode.

I think that fixes this issue

ddelange avatar Apr 19 '24 18:04 ddelange

@ddelange regardless of patches for TextIOWrapper implemented in v7.0.0 close() will be called on both writers when garbage collected, resulting in unwanted writes to S3.

donsokolone avatar Apr 19 '24 19:04 donsokolone

@donsokolone Can you think of a way we can avoid the unwanted writes?

mpenkov avatar Apr 20 '24 03:04 mpenkov

@mpenkov After doing some more troubleshooting I've created separate issue #819 as the problem I've mentioned affects only SinglepartWriter at the moment. Fix for that is quite simple and I will try to PR it today.

donsokolone avatar Apr 20 '24 06:04 donsokolone