smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

S3 Multipart uploads comitted on exception when using compression via context manager

Open Limess opened this issue 3 years ago • 2 comments

Problem description

  • We're trying to upload gzipped objects to S3, using multipart upload
  • We're using a context manager to open a gzipped output file in S3, and writing to the file. When the writing throws an exception we expect the multipart upload to be terminated
  • The multipart upload is commited

This is because when using a context manager around a compressed file, the MultipartWriter __exit__ method will not be called on https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L938-L942

We've seen two resulting behaviours from this issue:

  1. Empty or incomplete multipart uploads are written, this seems to be as expected when __exit__ is not called correctly
  2. Corrupted gzipped files being written. We're less sure how this occured

Steps/code to reproduce the problem

import smart_open

with smart_open.open(
    s3.s3_uri(output_bucket, "test-file.gz"),
    "wb",
) as fout:
    for data in ["valid-data", Exception("An exception")]:
		if isinstance(data, Exception)
			raise data
    	fout.write((doc + "\n").encode("utf-8"))

Versions

>>> import platform, sys, smart_open
m.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)>>> print(platform.platform())
Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid
>>> print("Python", sys.version)
Python 3.7.11 (default, Jan 13 2022, 14:48:06)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 5.2.1

Limess avatar Jan 17 '22 11:01 Limess

Thank you for reporting this. Are you interested in making a PR?

mpenkov avatar Jan 17 '22 12:01 mpenkov

Not right now, sorry.

Limess avatar Jan 17 '22 12:01 Limess

Any update? If not I can try to fix it

lociko avatar Jan 03 '23 15:01 lociko

No updates on this. Yes, please go ahead.

mpenkov avatar Jan 03 '23 16:01 mpenkov

I think the problem is here: https://github.com/RaRe-Technologies/smart_open/blob/18128f15e50c5ceda065a6d7d041eab6cb0933ad/smart_open/compression.py#L140

We should be making the result of compression_wrapper a context manager too, so it cleans up the underlying stream correctly.

Right?

mpenkov avatar Jan 03 '23 16:01 mpenkov

I think is not enough. We already have a tweak_close:


def tweak_close(outer, inner):
    """Ensure that closing the `outer` stream closes the `inner` stream as well.

    Use this when your compression library's `close` method does not
    automatically close the underlying filestream.  See
    https://github.com/RaRe-Technologies/smart_open/issues/630 for an
    explanation why that is a problem for smart_open.
    """
    outer_close = outer.close

    def close_both(*args):
        nonlocal inner
        try:
            outer_close()
        finally:
            if inner:
                inner, fp = None, inner
                fp.close()

    outer.close = close_both

Which close underlying streams but in the case of S3 MultipartWriter we want to terminate (abort) uploading if an error is raised. Like in this pice of code: https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L938-L942

So I see that when an exception is thrown tweak_close is called, but MultipartWriter just finish uploading right nothing is happen. I think is the main problem When we exception we should call MultipartWriter.terminate to abort uploading

lociko avatar Jan 04 '23 08:01 lociko

That sounds right. Is tweak_close able to determine when it is being called as part of exception handling?

If not, we may have to look at the stack.

mpenkov avatar Jan 04 '23 09:01 mpenkov

@mpenkov I'm not sure how to determine when it is being called as part of exception handling or not so I decided to use sys.exc_info().

lociko avatar Jan 04 '23 15:01 lociko

similar issue with Azure https://github.com/RaRe-Technologies/smart_open/pull/783

ddelange avatar Oct 05 '23 05:10 ddelange

@mpenkov I'm not sure how to determine when it is being called as part of exception handling or not so I decided to use sys.exc_info().

using sys.exc_info() will cause erroneous behaviour: if you want to write a log file to s3 during handling of some other upstream exception, you're inside an exception context (and you still want your multipart upload of the log file to succeed).

ddelange avatar Oct 05 '23 11:10 ddelange