urllib3 icon indicating copy to clipboard operation
urllib3 copied to clipboard

Twice memory usage in `encode_multipart_formdata`

Open Axik opened this issue 10 years ago • 25 comments

Basically, if you wants to upload file like 1GB it will cost you 2GB RAM.

Filename: /Users/kyrylo/.virtualenvs/drsdk/lib/python2.7/site-packages/requests/packages/urllib3/filepost.py

Line #    Mem usage    Increment   Line Contents
================================================
    59    981.3 MiB      0.0 MiB   @profile
    60                             def encode_multipart_formdata(fields, boundary=None):
    61                                 """
    62                                 Encode a dictionary of ``fields`` using the multipart/form-data MIME format.
    63                             
    64                                 :param fields:
    65                                     Dictionary of fields or list of (key, :class:`~urllib3.fields.RequestField`).
    66                             
    67                                 :param boundary:
    68                                     If not specified, then a random boundary will be generated using
    69                                     :func:`mimetools.choose_boundary`.
    70                                 """
    71    981.3 MiB      0.0 MiB       body = BytesIO()
    72    981.3 MiB      0.0 MiB       if boundary is None:
    73    981.3 MiB      0.0 MiB           boundary = choose_boundary()
    74                             
    75   1920.7 MiB    939.4 MiB       for field in iter_field_objects(fields):
    76    981.3 MiB   -939.4 MiB           body.write(b('--%s\r\n' % (boundary)))    77                             
    78    981.3 MiB      0.0 MiB           writer(body).write(field.render_headers())
    79    981.3 MiB      0.0 MiB           data = field.data
    80                             
    81    981.3 MiB      0.0 MiB           if isinstance(data, int):
    82                                         data = str(data)  # Backwards compatibility
    83                             
    84    981.3 MiB      0.0 MiB           if isinstance(data, six.text_type):
    85                                         writer(body).write(data)
    86                                     else:
    87   1920.7 MiB    939.4 MiB               body.write(data)
    88                             
    89   1920.7 MiB      0.0 MiB           body.write(b'\r\n')
    90                             
    91   1920.7 MiB      0.0 MiB       body.write(b('--%s--\r\n' % (boundary)))
    92                             
    93   1920.7 MiB      0.0 MiB       content_type = str('multipart/form-data; boundary=%s' % boundary)
    94                             
    95   2000.1 MiB     79.4 MiB       return body.getvalue(), content_type

P.S. using this profiler https://pypi.python.org/pypi/memory_profiler

Requirements from previous attempt

  • Add urllib3.multipart with MultipartEncoder and MultipartDecoder
  • Encoder supports a complete rewind with .seek(0, 0) which attempts to rewind all body parts.
  • Add Part class for representing a body part
  • Part should match BufferedReader with peek() and read() for chunks
  • encode_multipart_formdata should use the API internally.
  • Verify memory usage with memray

Axik avatar May 15 '15 00:05 Axik

Right, 1x to read the data and 1x to store our encoded copy of it.

We had some discussions about converting this into a generator[0] which would help bring it down to 1x unless you can provide the size upfront. I think there's a prototype or two somewhere, but it was never finished.

Is this something you'd be interested in working on?

[0] A very old filepost-stream branch with some in-progress work. I thought there was a PR somewhere too, but I can't find it.

shazow avatar May 15 '15 01:05 shazow

Filename: /Users/kyrylo/.virtualenvs/drsdk/lib/python2.7/site-packages/requests/packages/urllib3/filepost.py

Line #    Mem usage    Increment   Line Contents
================================================
    59    284.1 MiB      0.0 MiB   @profile
    60                             def encode_multipart_formdata(fields, boundary=None):
    61                                 """
    62                                 Encode a dictionary of ``fields`` using the multipart/form-data MIME format.
    63                             
    64                                 :param fields:
    65                                     Dictionary of fields or list of (key, :class:`~urllib3.fields.RequestField`).
    66                             
    67                                 :param boundary:
    68                                     If not specified, then a random boundary will be generated using
    69                                     :func:`mimetools.choose_boundary`.
    70                                 """
    71    284.1 MiB      0.0 MiB       body = BytesIO()
    72    284.1 MiB      0.0 MiB       if boundary is None:
    73    284.2 MiB      0.1 MiB           boundary = choose_boundary()
    74                             
    75   1751.8 MiB   1467.6 MiB       for field in iter_field_objects(fields):
    76    284.2 MiB  -1467.6 MiB           body.write(b('--%s\r\n' % (boundary)))
    77                             
    78    284.2 MiB      0.0 MiB           writer(body).write(field.render_headers())
    79    284.2 MiB      0.0 MiB           data = field.data
    80                             
    81    284.2 MiB      0.0 MiB           if isinstance(data, int):
    82                                         data = str(data)  # Backwards compatibility
    83                             
    84    284.2 MiB      0.0 MiB           if isinstance(data, six.text_type):
    85                                         writer(body).write(data)
    86                                     else:
    87   1751.8 MiB   1467.6 MiB               body.write(data)
    88                             
    89   1751.8 MiB      0.0 MiB           body.write(b'\r\n')
    90                             
    91   1751.8 MiB      0.0 MiB       body.write(b('--%s--\r\n' % (boundary)))
    92                             
    93   1751.8 MiB      0.0 MiB       content_type = str('multipart/form-data; boundary=%s' % boundary)
    94                             
    95   2402.7 MiB    650.9 MiB       return body.getvalue(), content_type

In "cold" start state, it even triple it.

Also, what you think about not storing encoded copy, but making changes in-place, using mutable container.

Axik avatar May 15 '15 09:05 Axik

note that the requests toolbelt has a Streaming Multipart Data Encoder that may be interesting here.

Lukasa avatar May 15 '15 10:05 Lukasa

thanks will try it just now

Axik avatar May 15 '15 10:05 Axik

I really should make that requests agnostic and land it in urllib3. I think even urllib3 users would find it beneficial and it already heavily leverages some of urllib3's internals

sigmavirus24 avatar May 15 '15 13:05 sigmavirus24

@Lukasa it works. Thank you.

@sigmavirus24 , @shazow Should I make PR to change behaviour to using this MultiPart Encoder if fileobject was provided?

Axik avatar May 15 '15 14:05 Axik

@Axik that would take a great deal more work than you might realize. Also, the current default behaviour suits most users perfectly well so I don't think it's necessary

sigmavirus24 avatar May 15 '15 14:05 sigmavirus24

+100 to bringing the streaming encoder into urllib3.

shazow avatar May 15 '15 18:05 shazow

(Related to #51, still would love to have this.)

shazow avatar Jun 24 '16 14:06 shazow

FYI I'm working on this right now. Will submit a PR once I get a bigger chunk existing of tests passing.

sethmlarson avatar Nov 06 '16 05:11 sethmlarson

@SethMichaelLarson what exactly are you doing?

sigmavirus24 avatar Nov 06 '16 12:11 sigmavirus24

@sigmavirus24 Taking the old branch that was mentioned in #51, bringing it up to date with the module, adding Python 2 and 3 compatibility, fix all tests, add new tests. Currently have all but 4 tests currently passing on the branch.

sethmlarson avatar Nov 06 '16 13:11 sethmlarson

@SethMichaelLarson the resolution was to bring the code from requests-toolbelt into urllib3. Please re-read the issue carefully.

sigmavirus24 avatar Nov 06 '16 14:11 sigmavirus24

@sigmavirus24 I thought I read the issue closely but I will re-read. The branch defines a MultipartEncoderGenerator object that is used in the same way that the tool-belt code would be used, I'll submit a PR today to show the work being done if you would like?

sethmlarson avatar Nov 06 '16 14:11 sethmlarson

@sethmlarson Can you please clarify why this is a breaking change? https://github.com/urllib3/urllib3/pull/1030 mentions that it should be deferred to v2, but I don't understand why.

pquentin avatar Nov 24 '20 05:11 pquentin

@pquentin I guess depending on how we accomplish this one it doesn't necessarily have to be a breaking change. We can deprecate encode_multipart_formdata in favor of a function that produces a BytesIO compatible body that can be .tell() for content-length and .seek() for rewinding. Then we can start using this function ourselves.

I think the original thought was to keep the same function and change the return value hence the breaking change but we don't have to do that?

  • Take encoder from toolbelt, Ian will relicense this code to MIT
  • Rewindability needs to be investigated more, maybe constraining the input would help?
    • If object supports seek() and tell() those APIs must work, then can use auto-detection

sethmlarson avatar Dec 16 '20 13:12 sethmlarson

I'd like to work on this.

I'll start from defining the scope and coming up with a strategy today.

Looks like there were multiple attempts to address this. @sethmlarson, could you give some insights on why #1030 was closed?

Bobronium avatar Jun 27 '22 11:06 Bobronium

It was closed because we wanted to save it for the v2 branch that has died five years ago now, not because of anything specific in #1030 itself.

pquentin avatar Jun 27 '22 12:06 pquentin

However the preferred path now is getting https://github.com/urllib3/urllib3/pull/2331 merged. It's very close but not quite there yet. I think @sigmavirus24 can't accept the bounty, so we might be okay to award a part of it to whoever gets that PR merged.

pquentin avatar Jun 27 '22 12:06 pquentin

However the preferred path now is getting https://github.com/urllib3/urllib3/pull/2331 merged.

Oh, I've totally missed that.

so we might be okay to award a part of it to whoever gets that PR merged.

I'm up for this. What can I do to get it merged (considering I don't have push permissions on sigmavirus24:add-multipart-formdata-encoder)?

Bobronium avatar Jun 27 '22 12:06 Bobronium

Please open a new pull request with "Closes #2331, Closes #624" in the description.

I'm not 100% sure about the bounty, so if that's an issue you may prefer to wait for @sigmavirus24 and @sethmlarson to confirm.

pquentin avatar Jun 27 '22 12:06 pquentin

Getting bounty is certainly desirable :)

Bobronium avatar Jun 27 '22 15:06 Bobronium