Using botocore_session parameter of s3.iter_bucket fails
Problem description
I am attempting to list all files in an S3 bucket, as per the documentation I am providing smart_open.s3.iter_bucket with a pre-made botocore_session. I have tried turning concurrency off and received a different exception but to no avail.
Expected result:
Works like it does when not providing the botocore_session parameter; the keys and contents of the files on S3 are iterated.
Actual result:
With smart_open.concurrency._MULTIPROCESSING = True
I receive the exception:
AttributeError: Can't pickle local object 'lazy_call.<locals>._handler'
With smart_open.concurrency._MULTIPROCESSING = False
I receive the exception:
RuntimeError: Cannot inject class attribute "upload_file", attribute already exists in class dict.
From my initial investigation it seems like the creation of many new boto3 sessions / resources from the same botocore session in smart_open.s3._download_key is the issue here, which might be a limitation/issue with the boto3 library.
Steps/code to reproduce the problem
import botocore
from smart_open import s3, concurrency
bucket_name = '[A_REAL_BUCKET]'
concurrency._MULTIPROCESSING = False
my_session = botocore.session.Session()
bucket_contents = list(s3.iter_bucket(bucket_name=bucket_name,
botocore_session=my_session,
workers=1))
Running the above results in the below traceback:
/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/bin/python /home/tyto/.config/JetBrains/PyCharm2021.2/scratches/scratch.py
11:52:28.649 INFO:smart_open.concurrency: creating concurrent futures pool with 1 workers
11:52:28.662 INFO:botocore.credentials: Found credentials in shared credentials file: ~/.aws/credentials
Traceback (most recent call last):
File "/home/tyto/.config/JetBrains/PyCharm2021.2/scratches/scratch.py", line 18, in <module>
bucket_contents = list(s3.iter_bucket(bucket_name=bucket_name,
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/smart_open/s3.py", line 1192, in iter_bucket
for key_no, (key, content) in enumerate(result_iterator):
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/smart_open/concurrency.py", line 58, in imap_unordered
yield future.result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 438, in result
return self.__get_result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/smart_open/s3.py", line 1246, in _download_key
s3 = session.resource('s3')
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/boto3/session.py", line 396, in resource
client = self.client(
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/boto3/session.py", line 270, in client
return self._session.create_client(
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/botocore/session.py", line 841, in create_client
client = client_creator.create_client(
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/botocore/client.py", line 84, in create_client
cls = self._create_client_class(service_name, service_model)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/botocore/client.py", line 115, in _create_client_class
self._event_emitter.emit(
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/botocore/hooks.py", line 357, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/boto3/utils.py", line 63, in _handler
return getattr(module, function_name)(**kwargs)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/boto3/s3/inject.py", line 22, in inject_s3_transfer_methods
utils.inject_attribute(class_attributes, 'upload_file', upload_file)
File "/home/tyto/.cache/pypoetry/virtualenvs/s3-collector-combiner-UhjNKLw0-py3.9/lib/python3.9/site-packages/boto3/utils.py", line 70, in inject_attribute
raise RuntimeError(
RuntimeError: Cannot inject class attribute "upload_file", attribute already exists in class dict.
Process finished with exit code 1
Note that with _MULTIPROCESSING = True a different traceback occurs. For now I'm putting that down to multiprocessing returning exceptions in another way but I'm happy to provide the output if required.
Versions
Linux-5.13.0-7620-generic-x86_64-with-glibc2.33
Python 3.9.5 (default, May 11 2021, 08:20:37)
[GCC 10.3.0]
smart_open 5.2.1
Checklist
Before you create the issue, please make sure you have:
- [x] Described the problem clearly
- [x] Provided a minimal reproducible example, including any required data
- [x] Provided the version numbers of the relevant software
Perhaps...perhaps... also experiencing this issue. On the latest version too but boto3 is throwing this when attempting to access a bucket attribute on an S3 client
Are either of you able to dig a little deeper and diagnose this issue further?
Yes, actually did. In our test infrastructure, we're re-using an already established session. Haven't dug too deep into things, but perhaps it has to do with a boto3 init function and when that get's executed. I've since removed the code that does something like this...
import boto3.session
import botocore.session
USE_SHARED_BOTOCORE_SESSION = object()
def __init__(hub):
# Create a single session for everything else to be run from
hub.SESSION = botocore.session.Session()
def get(
hub, botocore_session: botocore.session.Session = USE_SHARED_BOTOCORE_SESSION
) -> boto3.session.Session:
if botocore_session == USE_SHARED_BOTOCORE_SESSION:
return boto3.session.Session(botocore_session=hub.SESSION)
else:
return boto3.session.Session(botocore_session=botocore_session)
Taking a look, I believe I understand what's happening here. In iter_bucket, first we call _list_bucket and then we have a partial call to _download_key, both times passing in the **session_kwargs: https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L1179-L1183
https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L1184-L1188
My traceback (above) shows that the exception stemmed from a call to _download_key, specifically on line 1246 where the S3 resource is created: https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L1245-L1246
The previous call to _list_bucket has a similar start: https://github.com/RaRe-Technologies/smart_open/blob/59d3a6079b523c030c78a3935622cf94405ce052/smart_open/s3.py#L1212-L1213
So, I did a little further digging and I believe I've narrowed down the cause of the exception:
Creating two separate Boto3 Session objects from one botocore session
import boto3
import botocore
from smart_open import s3, concurrency
bucket_name = '[BUCKET_NAME]'
concurrency._MULTIPROCESSING = False
# create the botocore session object that we'll pass around
bc_session = botocore.session.Session()
# first, _list_bucket creates a boto3 session from the botocore session, by passing in botocore_session=bc_session
b3_session = boto3.session.Session(botocore_session=bc_session)
# then, _list_bucket goes on to create an s3 client, from the boto3 session
b3_session.client("s3")
# Up to this point, everything is fine
# We have a call to _download_key, which first creates a boto3 session from the botocore session, again
b3_session2 = boto3.session.Session(botocore_session=bc_session)
assert b3_session2 != b3_session, "These sessions are not the same object."
# then, _download_key creates an S3 resource from the boto3 session
b3_session2.resource("s3") # This causes the same exception!
After that initial confirmation, I removed lines of code and settled on this demo:
import boto3
import botocore
from smart_open import s3, concurrency
bucket_name = '[BUCKET_NAME]'
concurrency._MULTIPROCESSING = False
# create the botocore session object that we'll pass around
bc_session = botocore.session.Session()
# create a boto3 session from the above
b3_session = boto3.session.Session(botocore_session=bc_session)
# create another boto3 session, from the same botocore session
_ = boto3.session.Session(botocore_session=bc_session)
b3_session.resource("s3") # This causes the exception!
b3_session.client("s3") # This *also* causes the exception if you comment out the line above!
Conclusion
The issue is that two boto3 sessions are being created from a single botocore session. I think the issue may be sidestepped by passing the boto3 session object around as a parameter, as this would avoid creating multiple B3 sessions from one botocore session.
I think reusing the session makes sense. Are you able to make a PR?
@SootyOwl What is the use case? Why are you attempting to pass a botocore session? Are you sure that object is safe to pass across a subprocess boundary?
Under the hood, smart_open spins up multiple subprocesses, and passes your session to them. If the session is not designed for this, then things will break.
As discussed in #749, botocore sessions are not pickle-able, so it is impossible to pass them to a child subprocess, which is what iter_bucket does (it starts a bunch of subprocesses to do work in parallel, and passes session_kwargs to them).
You can work around this behavior by disabling multiprocessing:
smart_open.concurrency._MULTIPROCESSING = False