smart_open
smart_open copied to clipboard
Reading a 0 B file from s3 raises KeyError
Problem description
A problem is somewhat similar to the one described here https://github.com/RaRe-Technologies/smart_open/issues/548 , though the Error is not the same.
Be sure your description clearly answers the following questions:
- What are you trying to achieve? I'm trying to read the file that might be empty in S3.
- What is the expected result? The file is read without exceptions.
- What are you seeing instead? KeyError exception is thrown
Steps/code to reproduce the problem
- Have an empty file in S3
- Run the following code
from smart_open import open
with open('S3_uri', 'rb') as file:
file.read()
Traceback:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 330, in _get
return client.get_object(Bucket=bucket, Key=key, Range=range_string)
File "/home/user/.local/lib/python3.8/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/user/.local/lib/python3.8/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidRange) when calling the GetObject operation: The requested range is not satisfiable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 438, in _open_body
response = _get(
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 338, in _get
raise wrapped_error from error
OSError: unable to access bucket: 'mybucket' key: 'existing_file' version: None error: An error occurred (InvalidRange) when calling the GetObject operation: The requested range is not satisfiable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/.local/lib/python3.8/site-packages/smart_open/smart_open_lib.py", line 235, in open
binary = _open_binary_stream(uri, binary_mode, transport_params)
File "/home/user/.local/lib/python3.8/site-packages/smart_open/smart_open_lib.py", line 398, in _open_binary_stream
fobj = submodule.open_uri(uri, mode, transport_params)
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 224, in open_uri
return open(parsed_uri['bucket_id'], parsed_uri['key_id'], mode, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 291, in open
fileobj = Reader(
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 574, in __init__
self.seek(0)
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 666, in seek
self._current_pos = self._raw_reader.seek(offset, whence)
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 417, in seek
self._open_body(start, stop)
File "/home/user/.local/lib/python3.8/site-packages/smart_open/s3.py", line 450, in _open_body
self._position = self._content_length = int(error_response['ActualObjectSize'])
KeyError: 'ActualObjectSize'
Versions
Please provide the output of:
smart_open 5.2.1
Checklist
Before you create the issue, please make sure you have:
- [X] Described the problem clearly
- [X] Provided a minimal reproducible example, including any required data
- [X] Provided the version numbers of the relevant software
I am seeing the same issue with reading attempting to open a 0B file.
This PR was supposedly merged to fix this issue, but it actually introduces the missing KeyError reported above.
The expected key 'ActualObjectSize' cannot be found on botocore.exceptions.ClientError which is the wrapped error that gets returned from the boto3.client.get_object call.
I propose that instead of trying to get 'ActualObjectSize' from the wrapped error object, we instead get the content length by making a get_object call without the range_string if there is an InvalidRange error:
self._position = self._content_length = self._client.get_object(Bucket=self._bucket, Key=self._key)["ContentLength"]
Do we need to make an additional call? If yes, then I'd rather avoid doing unless it's absolutely necessary.
Are you interested in making a PR?
I am still facing "ClientError: An error occurred (416) when calling the GetObject operation: Requested Range Not Satisfiable" error with latest version 6.2.0
for files with 0 bytes. Even though it is supposed to be fixed in #548
I created a PR calling get_object
only when we get a KeyError
when accessing ActualObjectSize
.
This way it should limit unnecessary HTTP call.