botocore icon indicating copy to clipboard operation
botocore copied to clipboard

Upload cancellation causes long retry loop

Open andy-slac opened this issue 1 year ago • 4 comments

Describe the bug

I'm not exactly sure if this is botocore or s3transfer issue, more probably some inconsistency between the two. What I observe is that when there is an asynchronous exception (Control-C or similar "random" exception that we use to terminate the application) during S3 upload, the upload does not stop immediately, but instead it hangs for longer than a minute. This happens with multithreaded upload, single-threaded upload does not show the same issue.

Regression Issue

  • [ ] Select this option if this issue appears to be a regression.

Expected Behavior

I would expect that Control-C or other random exception should stop upload almost immediately.

Current Behavior

Running with debug logging enabled I think I see what happens:

  • When exception is raised, S3Transfer context manager __exit__ method is called which calls TransferManager.__exit__()
  • Latter calls TransferManager._shutdown with exception type FatalError or CancelledError depending on the type of original exception.
  • That in turn calls transfer coordinator managers' cancel() method passing the exception type to them.
  • Later when client tries to read something from input file, the exception in transfer coordinator is checked and if not Null then that exception is raised.
  • I think the intention is that this exception is supposed to cancel upload by that thread, instead it looks like the retry handler intercepts that exception and tries to restart the upload in each thread. This causes the same issue - exception is re-raised and intercepted again. It gives up after multiple iterations and that takes time

Here is an example of traceback dumped by boto from one of the threads when it goes into a retry loop:

2024-09-30 19:21:01,572 ThreadPoolExecutor-0_7 DEBUG botocore.httpsession (<...>/lib/python3.11/site-packages/botocore/httpsession.py:508) :: Exception received when sending urllib3 HTTP request
Traceback (most recent call last):
  File "<...>/lib/python3.11/site-packages/botocore/httpsession.py", line 464, in send
    urllib_response = conn.urlopen(
                      ^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
s3transfer.exceptions.FatalError: Received signal SIGINT.

Reproduction Steps

Here is the code that I used to debug the issue, the code is actually a trivial call to file_upload method. The trick is to trigger upload cancellation by hitting Ctrl-C at a correct moment. In our setup I used ~100MB file which gave me a reasonable time window during the upload. If you are lucky that after hitting Ctrl-C you should see enormous amount of tracebacks printed like the one I included above.

#!/usr/bin/env python

import argparse
import logging
import signal
import traceback

import botocore
import boto3

# Configure logging to use DEBUG, this shows tracebacks from execptions
# handled by boto internally.
fmt = "%(asctime)s %(threadName)s %(levelname)s %(name)s (%(pathname)s:%(lineno)d) :: %(message)s"
logging.basicConfig(level=logging.DEBUG, format=fmt)

_log = logging.getLogger()

# In our case we try to gracefully terminate transfers when a signal is sent
# by the scheduler which executes all processing jobs. This is done by raising
# special exception which inherits BaseException so that it's not swalowed by
# regular `except Exception`, i.e. it behaves more like KeyboardInterrupt.
class GracefulShutdownInterrupt(BaseException):
    pass


def _signal_handler(signum, stack_frame):
    signame = signal.Signals(signum).name
    _log.info(f"************* Signal {signame} detected, cleaning up and shutting down. ************\n")
    traceback.print_stack(stack_frame)
    raise GracefulShutdownInterrupt(f"Received signal {signame}.")


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("file", help="File to upload to S3.")
    parser.add_argument("bucket", help="Name S3 bucket.")
    parser.add_argument("path", help="Name of the object in the bucket.")
    parser.add_argument("-p", "--profile", default=None, help="AWS profile name.")
    parser.add_argument("--endpoint-url", default=None, help="Endpoint URL.")
    parser.add_argument("--access-key-id", default=None, help="Access key ID.")
    parser.add_argument("--secret-access-key", default=None, help="Secret access key.")
    args = parser.parse_args()

    session = boto3.Session(profile_name=args.profile)

    # This configuration is what we use by default in our setup.
    config = botocore.config.Config(read_timeout=180, retries={"mode": "adaptive", "max_attempts": 10})
    client = session.client(
        "s3",
        endpoint_url=args.endpoint_url,
        aws_access_key_id=args.access_key_id,
        aws_secret_access_key=args.secret_access_key,
        config=config,
    )

    # Installing handler is optional, if KeyboardInterrupt is raised instead of
    # GracefulShutdownInterrupt then the exception raised by coordinators will
    # be different (CancelledError vs. FatalError).
    signal.signal(signal.SIGINT, _signal_handler)
    signal.signal(signal.SIGTERM, _signal_handler)
    client.upload_file(args.file, args.bucket, args.path)


if __name__ == "__main__":
    main()

Possible Solution

Maybe retry handler should understand that certain kinds of exceptions should result in immediate failure?

Additional Information/Context

Package versions:

boto3                         1.34.131
botocore                      1.34.131
s3transfer                    0.10.2
urllib3                       1.26.19

SDK version used

1.34.131

Environment details (OS name and version, etc.)

Red Hat Enterprise Linux release 8.6 (Ootpa), kernel 4.18.0-372.32.1.el8_6.x86_64

andy-slac avatar Oct 01 '24 02:10 andy-slac

I can attach full logs from actual execution of that script with and without signal handler, each about 10MB in size.

andy-slac avatar Oct 01 '24 03:10 andy-slac

Thanks for reaching out. Can you share the full logs (with any sensitive info redacted) by adding boto3.set_stream_logger('') to your script? When you say "The trick is to trigger upload cancellation by hitting Ctrl-C at a correct moment", can you elaborate on that a bit more? Is this issue only reproducible when cancelling the upload at a certain point in the process?

I'll link the Boto3 docs on retries and timeout for Config for reference, although I see you already have those applied in your snippet. I'm wondering if specifying a lower timeout and max_attempts would reduce the delay you're seeing.

tim-finnigan avatar Oct 02 '24 19:10 tim-finnigan

Attaching the log with debug output: log-exc5-handler.log.gz

Reducing the delay is not my goal, the retries are configured reasonably in our case and there is no reason to change those. What I want is for transfer to stop instantly without any delay after the signal is received and exception is raised, there is no reason to retry in that case.

Hitting Ctrl-C at a correct moment means that moment should happen when upload has started but before it has finished, i.e. when the threads are actively pushing the data. That window may be short depending on the size of the file. I used ~100MB file in my case which gave me relatively good chance.

andy-slac avatar Oct 03 '24 17:10 andy-slac

Thanks for following up. It looks like https://github.com/boto/s3transfer/issues/249 may be related. We can plan to bring this issue up with the team for further review.

tim-finnigan avatar Oct 07 '24 21:10 tim-finnigan

Hello and thanks for brining this up. I was able to reproduce the issue and have seen that it retries after hitting Ctrl+C. I will present this to the team and see if FatalError is considered to be retriable or not. As of now, the reason why it retries is because looking at the httpsession.py, it falls to the Exception HTTPClientError and HTTPClientError is considered as a "Transient errors/exceptions" - which means that it is re-tried.

adev-code avatar Aug 22 '25 23:08 adev-code

PR and merged: https://github.com/boto/botocore/pull/3552 . Please update to the next version of Botocore. Thank you.

adev-code avatar Sep 09 '25 22:09 adev-code

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Sep 09 '25 22:09 github-actions[bot]