requests
requests copied to clipboard
When stream=True iter_content(chunk_size=None) reads the input as a single big chunk
According to the documentation when stream=True iter_content(chunk_size=None) "will read data as it arrives in whatever size the chunks are received", But it actually collects all input into a single big bytes object consuming large amounts of memory and entirely defeating the purpose of iter_content().
Expected Result
iter_content(chunk_size=None) yields "data as it arrives in whatever size the chunks are received".
Actual Result
A single big chunk
Reproduction Steps
from requests import get
URL = 'https://dl.fedoraproject.org/pub/alt/iot/32/IoT/x86_64/images/Fedora-IoT-32-20200603.0.x86_64.raw.xz'
r = get(URL, stream=True)
for b in r.iter_content(chunk_size=None):
print(len(b))
prints
533830860
System Information
$ python -m requests.help
{
"chardet": {
"version": "3.0.4"
},
"cryptography": {
"version": "2.9.2"
},
"idna": {
"version": "2.9"
},
"implementation": {
"name": "CPython",
"version": "3.7.6"
},
"platform": {
"release": "4.19.104+",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "1010107f",
"version": "19.1.0"
},
"requests": {
"version": "2.23.0"
},
"system_ssl": {
"version": "1010107f"
},
"urllib3": {
"version": "1.25.9"
},
"using_pyopenssl": true
}
chunk_size=None
as you've quoted relies on the size of the data as sent by the server. If there server is sending everything all at once and it's all on the socket, what do you expect the library to do differently?
@sigmavirus24 I don't think the server sends the file all at once. The example above produces no output for ~30 seconds and then prints 533830860. This starts printing right away:
from requests import get
URL = 'https://dl.fedoraproject.org/pub/alt/iot/32/IoT/x86_64/images/Fedora-IoT-32-20200603.0.x86_64.raw.xz'
r = get(URL, stream=True)
for b in r.iter_content(chunk_size=2**23):
print(len(b))
I have the same issue with 2.24.0
When I use a chunk_size
of 1
I get the expected output but with a huge overhead.
Can confirm the same is occurring.
Works as with chunk_size=1
, hangs with None
.
Can try to put together a reproducible example if that's helpful?
As promised, here's a reproducible example against httpbin.org:
import requests
chunk_size = None
URL = 'https://httpbin.org/drip?duration=2'
r = requests.get(URL, stream=True)
for x in r.iter_content(chunk_size=chunk_size):
print(f'response: {x}')
Run this and you'll see that iter_content
waits until the request is fully complete to return anything.
Change the chunk_size to 1 and everything works nicely (albeit with high overhead).
If somebody can point me in the right direction, I'm happy to investigate this and do what is required to fix it.
Any resolution to this? I am also still seeing this on v2.25.1
Hi @stephen-goveia, this is a behavior in urllib3 as noted in urllib3/urllib3#2123. We aren't able to change it in Requests, so the outcome will be determined by whether this makes it into the urllib3 v2 release.
thanks @nateprewitt!
Hi. I don't understand why this issue is still open. Here is a link to the official documentation.
chunk_size must be of type int or None. A value of None will function differently depending on the value of stream. stream=True will read data as it arrives in whatever size the chunks are received. If stream=False, data is returned as a single chunk.
Even after setting stream=True this is still an issue:
import requests
import time
chunk_size = None
URL = 'https://httpbin.org/drip?duration=20&numbytes=4'
r = requests.get(URL, stream=True)
t = time.monotonic()
for x in r.iter_content(chunk_size=chunk_size):
t2 = time.monotonic()
print(f'{t2 - t}')
t = time.monotonic()
prints:
15.593049310147762
Please keep in mind that I'm making this comment as a user, not as a contributor.
You're right. It is.. but please read the documentation. All I'm saying is that the documentation is clear enough, (or at least it is today):
When stream=True is set on the request, this avoids reading the content at once into memory for large responses
What should the module do when you ask not to download everything at once but to download "Nothing"? Should it throw an error? Should it not download anything at all?
Just check the content-length header and set a suitable chunk size when dealing with large files
It is not only about large files, it is also about SSE (server sent events). They are streamed, and clients expect them to arrive directly after the server sends them.
No movement on this in ~8 months... Any update?
Possible workaround using the Response.raw.stream()
, seems to work on my end:
resp = requests.get("something", stream=True)
for chunk in resp.raw.stream():
print(f"chunk size: {len(chunk)}")
@mbhynes Not sure what you were doing to have that "work", but it certainly doesn't do what I'd expect...
import requests
url = "https://httpbin.org/drip?duration=2&numbytes=8"
resp = requests.get(url, stream=True)
for chunk in resp.raw.stream():
print(f"chunk size: {len(chunk)}")
just gives me a single 8-byte chunk back after 2 seconds, rather than 8 single byte chunks every few hundred milliseconds.
I'd assume your endpoint happens to be returning the data via a "chunked transfer encoding" which has been able to handle streaming data in chunks for a long time already, but you could check by doing:
print(resp.headers.get("transfer-encoding"))
That said, I've created a pull-request with urllib3
(https://github.com/urllib3/urllib3/pull/3186) that can be built on to enable streaming in cases like this and I'd hope would allow the normal iter_content
method to yield data in appropriately sized chunks.