toolbelt illegal seek uploading data stream

Hi all,

I'm trying to use requests-toolbelt to stream a file that is being generated-on-the-fly. It seems like this use case does not currently work because requests-toolbelt requires the ability to get the length of the input, for example. The error I get is:

Traceback (most recent call last):
  File "bin/upload_file.py", line 175, in <module>
    main()
  File "bin/upload_file.py", line 168, in main
    print m.to_string()
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 112, in to_string
    return self.read()
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 128, in read
    self._load_bytes(size)
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 154, in _load_bytes
    written += self._consume_current_data(size)
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 173, in _consume_current_data
    super_len(self._current_data) > 0):
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/utils.py", line 50, in super_len
    return len(o)
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 262, in __len__
    return super_len(self.fd) - self.fd.tell()
IOError: [Errno 29] Illegal seek

Here is a simple test program I wrote to demonstrate the bug:

import csv
import os
from requests_toolbelt import MultipartEncoder
import requests

simple_csv = """1,2
3,4
5,6"""

def test1():
    m = MultipartEncoder(fields={
            'file1': ('file', simple_csv, 'text/plain'),
        })
    print m.to_string()

def test2():
    m = MultipartEncoder(fields={
            'file2': ('file', open('test.csv', 'r'), 'text/plain'),
        })
    print m.to_string()

def test3():
    # Construct a list of lists; internal lists are rows of the CSV file
    vals = [[int(s) for s in line.strip().split(',')]
            for line in simple_csv.split('\n')]
    # Build reader and writer objects using os.pipe so we can stream writing
    # and reading.
    r, w = os.pipe()
    reader = os.fdopen(r, 'r')
    writer = os.fdopen(w, 'w')

    # Do all the writing. In a real app, this would be done in parallel with the reading
    csv_writer = csv.writer(writer)
    for row in vals:
        csv_writer.writerow(row)
    writer.close()

    # Read the entire pipe out as a string, to make sure the pipe works
    back_to_string = reader.read()
    m = MultipartEncoder(fields={
            'file3': ('file', back_to_string, 'text/plain'),
        })
    print m.to_string()

def test4():
    # Construct a list of lists; internal lists are rows of the CSV file
    vals = [[int(s) for s in line.strip().split(',')]
            for line in simple_csv.split('\n')]
    # Build reader and writer objects using os.pipe so we can stream writing
    # and reading.
    r, w = os.pipe()
    reader = os.fdopen(r, 'r')
    writer = os.fdopen(w, 'w')

    # Do all the writing. In a real app, this would be done in parallel with the reading
    csv_writer = csv.writer(writer)
    for row in vals:
        csv_writer.writerow(row)
    writer.close()

    # Build the multipart request to read in a streaming fashion from the reader
    m = MultipartEncoder(fields={
            'file4': ('file', reader, 'text/plain'),
        })
    print m.to_string()

test1()
test2()
test3()
test4()

note that variants 1, 2, and 3 work -- only variant 4 is broken.

Is there any hope for making this work with requests-toolbelt or is the ability to determine the file length critical to this package working.

Thanks! Dan

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1700357-illegal-seek-uploading-data-stream?utm_campaign=plugin&utm_content=tracker%2F418367&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F418367&utm_medium=issues&utm_source=github).

Apr 16 '14 20:04 dhalperi

Output:

--55ff117fd70547cabb236396b33fbcb2
Content-Disposition: form-data; name="file1"; filename="file"
Content-Type: text/plain

1,2
3,4
5,6
--55ff117fd70547cabb236396b33fbcb2--

--12163767722346eeacfa122420a38855
Content-Disposition: form-data; name="file2"; filename="file"
Content-Type: text/plain

1,2
3,4
5,6

--12163767722346eeacfa122420a38855--

--68525a11bc354befaf86f52f428e2c4d
Content-Disposition: form-data; name="file3"; filename="file"
Content-Type: text/plain

1,2
3,4
5,6

--68525a11bc354befaf86f52f428e2c4d--

Traceback (most recent call last):
  File "test.py", line 70, in <module>
    test4()
  File "test.py", line 65, in test4
    print m.to_string()
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 112, in to_string
    return self.read()
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 128, in read
    self._load_bytes(size)
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 154, in _load_bytes
    written += self._consume_current_data(size)
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 173, in _consume_current_data
    super_len(self._current_data) > 0):
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/utils.py", line 50, in super_len
    return len(o)
  File "/Users/dhalperi/Envs/myria-python/lib/python2.7/site-packages/requests_toolbelt/multipart.py", line 262, in __len__
    return super_len(self.fd) - self.fd.tell()
IOError: [Errno 29] Illegal seek

Apr 16 '14 20:04 dhalperi

Hi, thanks for raising this!

The short answer is that, as it's currently written, we need the full length of the file. We could in principle remove that requirement, but then we need to send the body without a Content-Length. That requires a chunked upload, the support for which is really quite poor. Do you know whether your server supports that function? (Particularly when combined with multipart upload?)

Apr 16 '14 20:04 Lukasa

@Lukasa thanks for the quick response.

I am using Jersey, and it looks like it does support chunked encoding in multipart upload, as long as the user does the right trickery to enable it on the client: http://davidbuccola.blogspot.com/2009/09/configure-jersey-chunked-encoding-when.html

and/or the server: https://github.com/aruld/jersey2-multipart-sample

But I'm by no means an expert here, so I'm not certain :).

Apr 16 '14 20:04 dhalperi

Hey that's interesting! I'm impressed by Jersey.

@sigmavirus24, how much work do we think it would be to let the streaming encoder do chunked?

Apr 16 '14 20:04 Lukasa

Another fun tidbit. On the Java server side, the way we get the data to stream is using an InputStream as the object type (instead of, e.g., a pre-defined class/struct). It looks like there is only allowed to be one of these (which I think makes sense when you think about it), but this means that from Python it also has to be the last entry in the fields dictionary. Concretely:

This did not work:

m = MultipartEncoder(fields={
    'relationKey': ('relationKey', json.dumps(relation_key), 'application/json'),
    'schema': ('schema', json.dumps(schema), 'application/json'),
    'data': ('data', body, 'text/csv'),
})

But this did:

fields = OrderedDict()
fields['relationKey'] = ('relationKey', json.dumps(relation_key), 'application/json')
fields['schema'] = ('schema', json.dumps(schema), 'application/json')
fields['data'] = ('data', body, 'text/csv')
m = MultipartEncoder(fields=fields)

because the order of the keys was wrong in in the default dictionary.

Apr 16 '14 23:04 dhalperi

You can pass a list of tuples, e.g.,

m = MultipartEncoder(fields=[
    ('relationKey', ('relationKey', json.dumps(relation_key), 'application/json')),
    ('schema', ('schema', json.dumps(schema), 'application/json')),
    ('data', ('data', body, 'text/csv')),
])

We turn your dictionary into a list anyway so we will happily preserve that order.

To allow for either might be a bit difficult. Here's the thing, we'd have to provide two different interfaces to the data.

As it is now, requests determines how the data is sent. If it has a length (e.g., __len__ is defined) and a read method, then it will stream the data as the encoder currently does. For it to chunk the data, we would have to exclude the length and provide a way to iterate over the object.

I have a pattern in mind, but I'm not sure how well it would work. I'll need to think about this. Until then if always putting the data one last, you can pass a list.

Apr 17 '14 01:04 sigmavirus24

@sigmavirus24 I saw your mail but perhaps did not process it until now.

Your tip about passing a list worked out perfectly! I simplified the code a bunch by dropping the OrderedDict.
Are you saying that I might be able to avoid the seek error as long as the data is last? I was under the impression this issue was going to crop up regardless. (For now, we're pulling the entire dataset into memory and passing the string to requests-toolbelt.)

Apr 18 '14 19:04 dhalperi

re (2): I worked up a version of my code that does this, and still get the same seek error.

Apr 18 '14 20:04 dhalperi

Your tip about passing a list worked out perfectly!

I'm glad!

Are you saying that I might be able to avoid the seek error as long as the data is last?

I thought you had said that it was not happening if that was the case, but I misunderstood you. I have an idea for this fix though :)

Apr 18 '14 23:04 sigmavirus24

I wish I knew what the idea I had was. At the moment, the best I can think of is to have a ChunkedMultipartEncoder. It means that we'd have to change some of the logic of how we read from file like objects. We'll have to ensure that read sizes are always > 0. For any size that large, if we ever get an empty read we stop using that file. That's the only draw back I can see: If someone uses a file-like object with a non-blocking read, there won't be any way for the rest of the data to get through. Does that sound like a reasonable trade-off?

May 11 '14 00:05 sigmavirus24

File objects with non-blocking reads are generally a terrible idea. The assumption is usually that .read() with non-zero length will return data if the 'file' hasn't reached EOF, or will return nothing if it has. I added the .stream() method to urllib3 to get around the fact that it basically violates that assumption.

May 11 '14 07:05 Lukasa

File objects with non-blocking reads are generally a terrible idea.

Agreed.

I added the .stream() method to urllib3 to get around the fact that it basically violates that assumption.

I'm not sure I understand the significance entirely.

One other thing I forgot to mention about how we'd need to implement the chunked uploads: There would be no need for a read method. The object would have to implement __iter__ since requests expects the object to be iterable.

So the majority of the logic would be the same, the interface for each would be different. This means it boils down to having one class that handles the logic and two subclasses to implement the specific interfaces.

May 11 '14 16:05 sigmavirus24