smart_open
smart_open copied to clipboard
S3 Multipart uploads comitted on exception when using compression via context manager
Problem description
- We're trying to upload gzipped objects to S3, using multipart upload
- We're using a context manager to open a gzipped output file in S3, and writing to the file. When the writing throws an exception we expect the multipart upload to be terminated
- The multipart upload is commited
This is because when using a context manager around a compressed file, the MultipartWriter __exit__ method will not be called on https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L938-L942
We've seen two resulting behaviours from this issue:
- Empty or incomplete multipart uploads are written, this seems to be as expected when
__exit__is not called correctly - Corrupted gzipped files being written. We're less sure how this occured
Steps/code to reproduce the problem
import smart_open
with smart_open.open(
s3.s3_uri(output_bucket, "test-file.gz"),
"wb",
) as fout:
for data in ["valid-data", Exception("An exception")]:
if isinstance(data, Exception)
raise data
fout.write((doc + "\n").encode("utf-8"))
Versions
>>> import platform, sys, smart_open
m.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)>>> print(platform.platform())
Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-debian-bullseye-sid
>>> print("Python", sys.version)
Python 3.7.11 (default, Jan 13 2022, 14:48:06)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 5.2.1
Thank you for reporting this. Are you interested in making a PR?
Not right now, sorry.
Any update? If not I can try to fix it
No updates on this. Yes, please go ahead.
I think the problem is here: https://github.com/RaRe-Technologies/smart_open/blob/18128f15e50c5ceda065a6d7d041eab6cb0933ad/smart_open/compression.py#L140
We should be making the result of compression_wrapper a context manager too, so it cleans up the underlying stream correctly.
Right?
I think is not enough. We already have a tweak_close:
def tweak_close(outer, inner):
"""Ensure that closing the `outer` stream closes the `inner` stream as well.
Use this when your compression library's `close` method does not
automatically close the underlying filestream. See
https://github.com/RaRe-Technologies/smart_open/issues/630 for an
explanation why that is a problem for smart_open.
"""
outer_close = outer.close
def close_both(*args):
nonlocal inner
try:
outer_close()
finally:
if inner:
inner, fp = None, inner
fp.close()
outer.close = close_both
Which close underlying streams but in the case of S3 MultipartWriter we want to terminate (abort) uploading if an error is raised. Like in this pice of code: https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L938-L942
So I see that when an exception is thrown tweak_close is called, but MultipartWriter just finish uploading right nothing is happen. I think is the main problem
When we exception we should call MultipartWriter.terminate to abort uploading
That sounds right. Is tweak_close able to determine when it is being called as part of exception handling?
If not, we may have to look at the stack.
@mpenkov I'm not sure how to determine when it is being called as part of exception handling or not so I decided to use sys.exc_info().
similar issue with Azure https://github.com/RaRe-Technologies/smart_open/pull/783
@mpenkov I'm not sure how to determine when it is being called as part of exception handling or not so I decided to use
sys.exc_info().
using sys.exc_info() will cause erroneous behaviour: if you want to write a log file to s3 during handling of some other upstream exception, you're inside an exception context (and you still want your multipart upload of the log file to succeed).