Handle chunked responses when downloading resources
Describe the bug
stanza.download() fails to download resources from a host that sends a chunked response.
In [1]: import stanza
In [2]: stanza.download('en')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-c2f724e525cb> in <module>
----> 1 stanza.download('en')
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json)
577 if not download_json:
578 logger.warning("Asked to skip downloading resources.json, but the file does not exist. Downloading anyway")
--> 579 download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies)
580
581 resources = load_resources_json(model_dir)
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies)
457 resources_filepath = os.path.join(model_dir, 'resources.json')
458 # make request
--> 459 request_file(
460 resources_url,
461 resources_filepath,
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5)
155 with tempfile.TemporaryDirectory(dir=basedir) as temp:
156 temppath = os.path.join(temp, os.path.split(path)[-1])
--> 157 download_file(url, temppath, proxies, raise_for_status)
158 os.replace(temppath, path)
159 assert_file_exists(path, md5, alternate_md5)
/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status)
121 r.raise_for_status()
122 with open(path, 'wb') as f:
--> 123 file_size = int(r.headers.get('content-length'))
124 default_chunk_size = 131072
125 desc = 'Downloading ' + url
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
In [3]:
The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL)
In [3]: import os
In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')
Out[4]: True
In [5]:
download_file() unconditionally parses the HTTP Content-Length header into an integer to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError.
- Since
requestsis an HTTP/1.1-only client, chunked responses cannot be avoided by making HTTP/1.0 requests - All HTTP/1.1 compliant clients are required to handle chunked responses. They cannot be disabled.
To Reproduce Steps to reproduce the behavior:
- Define a server via
STANZA_RESOURCES_URLthat sends theresources_1.x.y.jsonin a chunked response - python3 -c 'import stanza; stanza.download("en")'
- See stack trace
Expected behavior Downloads of resources should work from HTTP/1.1 compliant servers.
Environment (please complete the following information):
- OS: Ubuntu 20.04, MacOS 15.2
- Python version: Python 3.8.10, Python 3.13.1
- Stanza version: 1.4.0, 1.10.1
Additional context
We are using stanza in an enterprise setting and can only download resources from a centralized caching server.
Thank you for reporting. So basically the constant is unnecessary aside from display? Should be an easy fix.
Do you have an example server I could try to verify the fix?
On Fri, Jan 3, 2025, 8:06 AM dtk @.***> wrote:
Describe the bug stanza.download() fails to download resources from a host that sends a chunked response https://en.wikipedia.org/wiki/Chunked_transfer_encoding.
In [1]: import stanza In [2]: stanza.download('en')---------------------------------------------------------------------------TypeError Traceback (most recent call last)
in ----> 1 stanza.download('en') /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json) 577 if not download_json: 578 logger.warning("Asked to skip downloading resources.json, but the file does not exist. Downloading anyway")--> 579 download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies) 580 581 resources = load_resources_json(model_dir) /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies) 457 resources_filepath = os.path.join(model_dir, 'resources.json') 458 # make request--> 459 request_file( 460 resources_url, 461 resources_filepath, /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5) 155 with tempfile.TemporaryDirectory(dir=basedir) as temp: 156 temppath = os.path.join(temp, os.path.split(path)[-1])--> 157 download_file(url, temppath, proxies, raise_for_status) 158 os.replace(temppath, path) 159 assert_file_exists(path, md5, alternate_md5) /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status) 121 r.raise_for_status() 122 with open(path, 'wb') as f:--> 123 file_size = int(r.headers.get('content-length')) 124 default_chunk_size = 131072 125 desc = 'Downloading ' + url TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' In [3]: The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL)
In [3]: import os In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')Out[4]: True In [5]:
download_file() https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L114 unconditionally parses the HTTP Content-Length header into an integer https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L123 to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError.
- Since requests is an HTTP/1.1-only client https://github.com/psf/requests/issues/5512#issuecomment-647946317, chunked responses cannot be avoided by making HTTP/1.0 requests
- All HTTP/1.1 compliant clients are required to handle chunked responses https://stackoverflow.com/questions/31969990/how-to-tell-the-http-server-to-not-send-chunked-encoding/31970668#31970668. They cannot be disabled.
To Reproduce Steps to reproduce the behavior:
- Define a server via STANZA_RESOURCES_URL that sends the resources_1.x.y.json in a chunked response
- python3 -c 'import stanza; stanza.download("en")'
- See stack trace
Expected behavior Downloads of resources should work from HTTP/1.1 compliant servers.
Environment (please complete the following information):
- OS: Ubuntu 20.04, MacOS 15.2
- Python version: Python 3.8.10, Python 3.13.1
- Stanza version: 1.4.0, 1.10.1
Additional context We are using stanza in an enterprise setting and can only download resources from a centralized caching server https://jfrog.com/de/artifactory/.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIV3K7OHNZ6CWIKX4D2I2KQFAVCNFSM6AAAAABURYBQROVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DONRTGMYDSMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Have pushed a fix to the dev branch.
How urgent is this fix / can you use the dev branch in your environment? Or only released versions possible? It's actually not super stressful to make a new release as long as the models haven't changed, and nothing's changed in the last week or so
I'm looking into adding a test server framework which will let us unit test the downloads to catch things like this, possibly https://github.com/csernazs/pytest-httpserver
Thanks for the super quick response!
Do you have an example server I could try to verify the fix?
Unfortunately, the server I'm using is only reachable from within the company's network. However, I should be able to validate the fix in a non-productive environment next week.
How urgent is this fix
I have implemented a hacky workaround (by preloading all resources via cURL and disabling all downloads through stanza), so we currently aren't blocked and are happy to wait for the proper release.
Thanks again for the great response! dtk
I can indeed confirm that the fix works for us:
(stanza-dev) stanza@f66684eeb8db:/tmp$ python3
Python 3.8.10 (default, Nov 7 2024, 13:10:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> stanza.download('en')
Downloading https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 25.6MB/s]
2025-01-06 11:34:11 INFO: Downloaded file to /home/stanza/stanza_resources/resources.json
2025-01-06 11:34:11 INFO: Downloading default packages for language: en (English) ...
Downloading https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip: 100%|███████████████████████████████████████████████████████████████████| 479M/479M [00:47<00:00, 10.2MB/s]
2025-01-06 11:35:00 INFO: Downloaded file to /home/stanza/stanza_resources/en/default.zip
2025-01-06 11:35:03 INFO: Finished downloading models and saved to /home/stanza/stanza_resources
That is for both chunked and un-chunked transfers (note the progress bar for the model download above):
>>> resources_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json')
>>> resources_response.headers.get('transfer-encoding')
'chunked'
>>> 'content-length' in resources_response.headers
False
>>> models_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip')
>>> models_response.headers.get('content-length')
'479293702'
>>>
Thank you!
Excellent, glad to hear it
This fix is now in 1.11.0