internetarchive icon indicating copy to clipboard operation
internetarchive copied to clipboard

Fixes memoryview error when running `Item.upload` with `StringIO` input and `verbose=True`

Open varun-magesh opened this issue 2 years ago • 17 comments

The Bug

When running, for example:

import io
import internetarchive as ia

ia.upload("ia_bug_test", {"test.txt": StringIO("test")}, access_key="", secret_key="", verbose=True)

instead of the expected message:

 uploading test.txt: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.30MiB/s]

the Python output is a long error ending in:

TypeError: memoryview: a bytes-like object is required, not 'str'

(the entire error is pasted at the end of this message). This only occurs when the input is a StringIO object and verbose=True.

This error is not reproducible in the test suite because it depends on an interaction between internetarchive and the internals of http. http checks to see if body input is a TextIOBase object. If it is a TextIOBase, http encodes the string output of the TextIOBase into bytes. However, only when verbose=True, internetarchive wraps the TextIOBase in a tqdm iterable and then IterableToFileAdapter, which hides the information that, internally, the data from this object is from a TextIOBase. This causes http's check to return False, so http does not encode the strings into bytes, which eventually causes this memoryview error.

The Fix

The proposed fix to this issue makes a small modification to IterableToFileAdapter which automatically encodes data as it is read out. The pre_encode flag is set to enable this feature when we observe a TextIOBase, but it could be expanded to other situations where we have an input that http won't recognize and encode for us.

Appendix

 uploading test.txt:   0%|                                                           | 0/1 [00:00<?, ?MiB/s]Traceback (most recent call last):
  File "/home/varun/harvard-lil/internetarchive/internetarchive/session.py", line 540, in send
    reraise_modify(e, e.request.url, prepend=False)  # type: ignore
AttributeError: 'TypeError' object has no attribute 'request'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/varun/harvard-lil/internetarchive/../perma/ia_test.py", line 7, in <module>
    internetarchive.upload("perma_cc_mvarun_bulktest", {"test.txt": StringIO("test")}, access_key=ACCESS, secret_key=SECRET, verbose=True)
  File "/home/varun/harvard-lil/internetarchive/internetarchive/api.py", line 277, in upload
    return item.upload(
  File "/home/varun/harvard-lil/internetarchive/internetarchive/item.py", line 1190, in upload
    resp = self.upload_file(body,
  File "/home/varun/harvard-lil/internetarchive/internetarchive/item.py", line 1006, in upload_file
    response = self.session.send(prepared_request,
  File "/home/varun/harvard-lil/internetarchive/internetarchive/session.py", line 543, in send
    raise e
  File "/home/varun/harvard-lil/internetarchive/internetarchive/session.py", line 537, in send
    r = super().send(request, **kwargs)
  File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/varun/harvard-lil/internetarchive/lib/python3.10/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1076, in _send_output
    self.send(chunk)
  File "/usr/lib/python3.10/http/client.py", line 1002, in send
    self.sock.sendall(d)
  File "/usr/lib/python3.10/ssl.py", line 1234, in sendall
    with memoryview(data) as view, view.cast("B") as byte_view:
TypeError: memoryview: a bytes-like object is required, not 'str'

varun-magesh avatar Jul 28 '22 19:07 varun-magesh

Thank you @varun-iyer! This looks good to me besides the failed checks below.

@cclauss Any feedback besides fixing the failed checks?

jjjake avatar Jul 29 '22 16:07 jjjake

For bytes, I would normally use io.BytesIO() or memoryview() instead of io.StringIO().

cclauss avatar Jul 30 '22 00:07 cclauss

@cclauss, I'm not sure I fully understand your comment --- but the basic problem here is that we have a StringIO input that, at some point, needs to be converted into bytes since http won't do that for us automatically because of the IterableToFileAdapter wrapper.

The current PR does this conversion as bytes are read out of the adapter. However, it could possibly be done earlier in the process by converting the StringIO into a BytesIO object before it goes into the IterableToFileAdapter. Is this what you are suggesting?

I didn't implement it this way initially because it would require bytes to be read out of the IO object twice over, once during the conversion and once out of the converted object. This might be worse for performance, but the effect is likely small, so I could switch it over if that makes more sense.

varun-magesh avatar Aug 01 '22 19:08 varun-magesh

my_string.encode() --> bytes my_bytes.decode() --> str

cclauss avatar Aug 01 '22 20:08 cclauss

@cclauss sorry --- I’m still not understanding what change you’re proposing to the pull request! To be clear, I am not trying to upload raw bytes to IA, and StringIO input is a documented feature that works as expected when verbose=False.

varun-magesh avatar Aug 01 '22 21:08 varun-magesh

@jjjake Please approve the GitHub Actions so the tests are run.

cclauss avatar Aug 01 '22 21:08 cclauss

I'm currently able to reproduce the two mypy test failures on jjjake/internetarchive:master --- maybe some upstream change?

It looks like an explicit type specification fixed the issue, but I don't have much experience with mypy. In any case, tests are passing now!

varun-magesh avatar Aug 01 '22 22:08 varun-magesh

@varun-iyer Looks like we're having issues with the CaseInsensitiveDict import. It does look like we can import it from requests though:

>>> from requests.structures import CaseInsensitiveDict

Besides that, looking good! Thanks again.

jjjake avatar Aug 02 '22 19:08 jjjake

Fixed! thanks for the suggestion @jjjake!

varun-magesh avatar Aug 02 '22 19:08 varun-magesh

Thanks @varun-iyer! I'll merge after the tests finish.

jjjake avatar Aug 02 '22 19:08 jjjake

Shoot, looks like it still didn't like that. Perhaps ignoring it, as you suggested, is the way to go. I'm still familiarizing myself with mypy as well, so I'm curious if @cclauss has any feedback here. Thanks again, and my apologies for the issues!

jjjake avatar Aug 02 '22 19:08 jjjake

Arg, I wasn't running into this with pre-commit yesterday. I tried switching to typing.cast instead, that seems to do the trick locally.

varun-magesh avatar Aug 02 '22 19:08 varun-magesh

@jjjake --- it looks like this is a problem with the change in typing syntax with Python 3.9/3.10 --- I'm thinking of giving up on whack-a-mole and just having mypy ignore this line, if that works for you. The problem anyway is that the variable is typed to Dict[Union[str, bytes]] and mypy is losing it when trying to assign a Dict[str] which is always a subset of Dict[Union[str, bytes]]...

varun-magesh avatar Aug 04 '22 20:08 varun-magesh

@varun-iyer Yes, having mypy ignore this line works for me. Thanks a lot for the contribution and research into the problem, I appreciate it!

jjjake avatar Aug 04 '22 20:08 jjjake

@jjjake happy to help! Changed the line to # type: ignore

varun-magesh avatar Aug 04 '22 20:08 varun-magesh

I ignored that line also in #542

cclauss avatar Aug 04 '22 20:08 cclauss

@jjjake @JustAnotherArchivist addressed the style issues, left the encoding alone for now. Let me know what your thoughts are.

Update: did some digging, found that there are good reasons for encoding (described above). In lieu of modification, included comments explaining choice of encoding.

varun-magesh avatar Aug 14 '22 06:08 varun-magesh

Thanks again @varun-iyer!

jjjake avatar Aug 15 '22 17:08 jjjake