internetarchive
internetarchive copied to clipboard
Fixes memoryview error when running `Item.upload` with `StringIO` input and `verbose=True`
The Bug
When running, for example:
import io
import internetarchive as ia
ia.upload("ia_bug_test", {"test.txt": StringIO("test")}, access_key="", secret_key="", verbose=True)
instead of the expected message:
uploading test.txt: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.30MiB/s]
the Python output is a long error ending in:
TypeError: memoryview: a bytes-like object is required, not 'str'
(the entire error is pasted at the end of this message). This only occurs when the input is a StringIO
object and verbose=True
.
This error is not reproducible in the test suite because it depends on an interaction between internetarchive
and the internals of http
. http
checks to see if body input is a TextIOBase
object. If it is a TextIOBase
, http
encodes the string output of the TextIOBase
into bytes. However, only when verbose=True
, internetarchive
wraps the TextIOBase
in a tqdm
iterable and then IterableToFileAdapter
, which hides the information that, internally, the data from this object is from a TextIOBase
. This causes http
's check to return False
, so http
does not encode the strings into bytes, which eventually causes this memoryview error.
The Fix
The proposed fix to this issue makes a small modification to IterableToFileAdapter
which automatically encode
s data as it is read out. The pre_encode
flag is set to enable this feature when we observe a TextIOBase
, but it could be expanded to other situations where we have an input that http
won't recognize and encode for us.
Appendix
uploading test.txt: 0%| | 0/1 [00:00<?, ?MiB/s]Traceback (most recent call last):
File "/home/varun/harvard-lil/internetarchive/internetarchive/session.py", line 540, in send
reraise_modify(e, e.request.url, prepend=False) # type: ignore
AttributeError: 'TypeError' object has no attribute 'request'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/varun/harvard-lil/internetarchive/../perma/ia_test.py", line 7, in <module>
internetarchive.upload("perma_cc_mvarun_bulktest", {"test.txt": StringIO("test")}, access_key=ACCESS, secret_key=SECRET, verbose=True)
File "/home/varun/harvard-lil/internetarchive/internetarchive/api.py", line 277, in upload
return item.upload(
File "/home/varun/harvard-lil/internetarchive/internetarchive/item.py", line 1190, in upload
resp = self.upload_file(body,
File "/home/varun/harvard-lil/internetarchive/internetarchive/item.py", line 1006, in upload_file
response = self.session.send(prepared_request,
File "/home/varun/harvard-lil/internetarchive/internetarchive/session.py", line 543, in send
raise e
File "/home/varun/harvard-lil/internetarchive/internetarchive/session.py", line 537, in send
r = super().send(request, **kwargs)
File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib/python3.10/http/client.py", line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1076, in _send_output
self.send(chunk)
File "/usr/lib/python3.10/http/client.py", line 1002, in send
self.sock.sendall(d)
File "/usr/lib/python3.10/ssl.py", line 1234, in sendall
with memoryview(data) as view, view.cast("B") as byte_view:
TypeError: memoryview: a bytes-like object is required, not 'str'
Thank you @varun-iyer! This looks good to me besides the failed checks below.
@cclauss Any feedback besides fixing the failed checks?
For bytes
, I would normally use io.BytesIO()
or memoryview()
instead of io.StringIO()
.
@cclauss, I'm not sure I fully understand your comment --- but the basic problem here is that we have a StringIO
input that, at some point, needs to be converted into bytes since http
won't do that for us automatically because of the IterableToFileAdapter
wrapper.
The current PR does this conversion as bytes are read out of the adapter. However, it could possibly be done earlier in the process by converting the StringIO
into a BytesIO
object before it goes into the IterableToFileAdapter
. Is this what you are suggesting?
I didn't implement it this way initially because it would require bytes to be read out of the IO object twice over, once during the conversion and once out of the converted object. This might be worse for performance, but the effect is likely small, so I could switch it over if that makes more sense.
my_string.encode() --> bytes my_bytes.decode() --> str
@cclauss sorry --- I’m still not understanding what change you’re proposing to the pull request!
To be clear, I am not trying to upload raw bytes to IA, and StringIO input is a documented feature that works as expected when verbose=False
.
@jjjake Please approve the GitHub Actions so the tests are run.
I'm currently able to reproduce the two mypy
test failures on jjjake/internetarchive:master
--- maybe some upstream change?
It looks like an explicit type specification fixed the issue, but I don't have much experience with mypy. In any case, tests are passing now!
@varun-iyer Looks like we're having issues with the CaseInsensitiveDict
import. It does look like we can import it from requests
though:
>>> from requests.structures import CaseInsensitiveDict
Besides that, looking good! Thanks again.
Fixed! thanks for the suggestion @jjjake!
Thanks @varun-iyer! I'll merge after the tests finish.
Shoot, looks like it still didn't like that. Perhaps ignoring it, as you suggested, is the way to go. I'm still familiarizing myself with mypy as well, so I'm curious if @cclauss has any feedback here. Thanks again, and my apologies for the issues!
Arg, I wasn't running into this with pre-commit yesterday. I tried switching to typing.cast
instead, that seems to do the trick locally.
@jjjake --- it looks like this is a problem with the change in typing syntax with Python 3.9/3.10 --- I'm thinking of giving up on whack-a-mole and just having mypy ignore this line, if that works for you. The problem anyway is that the variable is typed to Dict[Union[str, bytes]]
and mypy
is losing it when trying to assign a Dict[str]
which is always a subset of Dict[Union[str, bytes]]
...
@varun-iyer Yes, having mypy ignore this line works for me. Thanks a lot for the contribution and research into the problem, I appreciate it!
@jjjake happy to help! Changed the line to # type: ignore
I ignored that line also in #542
@jjjake @JustAnotherArchivist addressed the style issues, left the encoding alone for now. Let me know what your thoughts are.
Update: did some digging, found that there are good reasons for encoding (described above). In lieu of modification, included comments explaining choice of encoding.
Thanks again @varun-iyer!