urllib3
urllib3 copied to clipboard
Twice memory usage in `encode_multipart_formdata`
Basically, if you wants to upload file like 1GB it will cost you 2GB RAM.
Filename: /Users/kyrylo/.virtualenvs/drsdk/lib/python2.7/site-packages/requests/packages/urllib3/filepost.py
Line # Mem usage Increment Line Contents
================================================
59 981.3 MiB 0.0 MiB @profile
60 def encode_multipart_formdata(fields, boundary=None):
61 """
62 Encode a dictionary of ``fields`` using the multipart/form-data MIME format.
63
64 :param fields:
65 Dictionary of fields or list of (key, :class:`~urllib3.fields.RequestField`).
66
67 :param boundary:
68 If not specified, then a random boundary will be generated using
69 :func:`mimetools.choose_boundary`.
70 """
71 981.3 MiB 0.0 MiB body = BytesIO()
72 981.3 MiB 0.0 MiB if boundary is None:
73 981.3 MiB 0.0 MiB boundary = choose_boundary()
74
75 1920.7 MiB 939.4 MiB for field in iter_field_objects(fields):
76 981.3 MiB -939.4 MiB body.write(b('--%s\r\n' % (boundary))) 77
78 981.3 MiB 0.0 MiB writer(body).write(field.render_headers())
79 981.3 MiB 0.0 MiB data = field.data
80
81 981.3 MiB 0.0 MiB if isinstance(data, int):
82 data = str(data) # Backwards compatibility
83
84 981.3 MiB 0.0 MiB if isinstance(data, six.text_type):
85 writer(body).write(data)
86 else:
87 1920.7 MiB 939.4 MiB body.write(data)
88
89 1920.7 MiB 0.0 MiB body.write(b'\r\n')
90
91 1920.7 MiB 0.0 MiB body.write(b('--%s--\r\n' % (boundary)))
92
93 1920.7 MiB 0.0 MiB content_type = str('multipart/form-data; boundary=%s' % boundary)
94
95 2000.1 MiB 79.4 MiB return body.getvalue(), content_type
P.S. using this profiler https://pypi.python.org/pypi/memory_profiler
Requirements from previous attempt
- Add
urllib3.multipartwithMultipartEncoderandMultipartDecoder - Encoder supports a complete rewind with
.seek(0, 0)which attempts to rewind all body parts. - Add
Partclass for representing a body part Partshould match BufferedReader withpeek()andread()for chunksencode_multipart_formdatashould use the API internally.- Verify memory usage with memray
Right, 1x to read the data and 1x to store our encoded copy of it.
We had some discussions about converting this into a generator[0] which would help bring it down to 1x unless you can provide the size upfront. I think there's a prototype or two somewhere, but it was never finished.
Is this something you'd be interested in working on?
[0] A very old filepost-stream branch with some in-progress work. I thought there was a PR somewhere too, but I can't find it.
Filename: /Users/kyrylo/.virtualenvs/drsdk/lib/python2.7/site-packages/requests/packages/urllib3/filepost.py
Line # Mem usage Increment Line Contents
================================================
59 284.1 MiB 0.0 MiB @profile
60 def encode_multipart_formdata(fields, boundary=None):
61 """
62 Encode a dictionary of ``fields`` using the multipart/form-data MIME format.
63
64 :param fields:
65 Dictionary of fields or list of (key, :class:`~urllib3.fields.RequestField`).
66
67 :param boundary:
68 If not specified, then a random boundary will be generated using
69 :func:`mimetools.choose_boundary`.
70 """
71 284.1 MiB 0.0 MiB body = BytesIO()
72 284.1 MiB 0.0 MiB if boundary is None:
73 284.2 MiB 0.1 MiB boundary = choose_boundary()
74
75 1751.8 MiB 1467.6 MiB for field in iter_field_objects(fields):
76 284.2 MiB -1467.6 MiB body.write(b('--%s\r\n' % (boundary)))
77
78 284.2 MiB 0.0 MiB writer(body).write(field.render_headers())
79 284.2 MiB 0.0 MiB data = field.data
80
81 284.2 MiB 0.0 MiB if isinstance(data, int):
82 data = str(data) # Backwards compatibility
83
84 284.2 MiB 0.0 MiB if isinstance(data, six.text_type):
85 writer(body).write(data)
86 else:
87 1751.8 MiB 1467.6 MiB body.write(data)
88
89 1751.8 MiB 0.0 MiB body.write(b'\r\n')
90
91 1751.8 MiB 0.0 MiB body.write(b('--%s--\r\n' % (boundary)))
92
93 1751.8 MiB 0.0 MiB content_type = str('multipart/form-data; boundary=%s' % boundary)
94
95 2402.7 MiB 650.9 MiB return body.getvalue(), content_type
In "cold" start state, it even triple it.
Also, what you think about not storing encoded copy, but making changes in-place, using mutable container.
note that the requests toolbelt has a Streaming Multipart Data Encoder that may be interesting here.
thanks will try it just now
I really should make that requests agnostic and land it in urllib3. I think even urllib3 users would find it beneficial and it already heavily leverages some of urllib3's internals
@Lukasa it works. Thank you.
@sigmavirus24 , @shazow Should I make PR to change behaviour to using this MultiPart Encoder if fileobject was provided?
@Axik that would take a great deal more work than you might realize. Also, the current default behaviour suits most users perfectly well so I don't think it's necessary
+100 to bringing the streaming encoder into urllib3.
(Related to #51, still would love to have this.)
FYI I'm working on this right now. Will submit a PR once I get a bigger chunk existing of tests passing.
@SethMichaelLarson what exactly are you doing?
@sigmavirus24 Taking the old branch that was mentioned in #51, bringing it up to date with the module, adding Python 2 and 3 compatibility, fix all tests, add new tests. Currently have all but 4 tests currently passing on the branch.
@SethMichaelLarson the resolution was to bring the code from requests-toolbelt into urllib3. Please re-read the issue carefully.
@sigmavirus24 I thought I read the issue closely but I will re-read. The branch defines a MultipartEncoderGenerator object that is used in the same way that the tool-belt code would be used, I'll submit a PR today to show the work being done if you would like?
@sethmlarson Can you please clarify why this is a breaking change? https://github.com/urllib3/urllib3/pull/1030 mentions that it should be deferred to v2, but I don't understand why.
@pquentin I guess depending on how we accomplish this one it doesn't necessarily have to be a breaking change. We can deprecate encode_multipart_formdata in favor of a function that produces a BytesIO compatible body that can be .tell() for content-length and .seek() for rewinding. Then we can start using this function ourselves.
I think the original thought was to keep the same function and change the return value hence the breaking change but we don't have to do that?
- Take encoder from toolbelt, Ian will relicense this code to MIT
- Rewindability needs to be investigated more, maybe constraining the input would help?
- If object supports seek() and tell() those APIs must work, then can use auto-detection
I'd like to work on this.
I'll start from defining the scope and coming up with a strategy today.
Looks like there were multiple attempts to address this. @sethmlarson, could you give some insights on why #1030 was closed?
It was closed because we wanted to save it for the v2 branch that has died five years ago now, not because of anything specific in #1030 itself.
However the preferred path now is getting https://github.com/urllib3/urllib3/pull/2331 merged. It's very close but not quite there yet. I think @sigmavirus24 can't accept the bounty, so we might be okay to award a part of it to whoever gets that PR merged.
However the preferred path now is getting https://github.com/urllib3/urllib3/pull/2331 merged.
Oh, I've totally missed that.
so we might be okay to award a part of it to whoever gets that PR merged.
I'm up for this. What can I do to get it merged (considering I don't have push permissions on sigmavirus24:add-multipart-formdata-encoder)?
Please open a new pull request with "Closes #2331, Closes #624" in the description.
I'm not 100% sure about the bounty, so if that's an issue you may prefer to wait for @sigmavirus24 and @sethmlarson to confirm.
Getting bounty is certainly desirable :)