stanza Handle chunked responses when downloading resources

Describe the bug stanza.download() fails to download resources from a host that sends a chunked response.

In [1]: import stanza

In [2]: stanza.download('en')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-c2f724e525cb> in <module>
----> 1 stanza.download('en')

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json)
    577         if not download_json:
    578             logger.warning("Asked to skip downloading resources.json, but the file does not exist.  Downloading anyway")
--> 579         download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies)
    580
    581     resources = load_resources_json(model_dir)

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies)
    457         resources_filepath = os.path.join(model_dir, 'resources.json')
    458     # make request
--> 459     request_file(
    460         resources_url,
    461         resources_filepath,

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5)
    155     with tempfile.TemporaryDirectory(dir=basedir) as temp:
    156         temppath = os.path.join(temp, os.path.split(path)[-1])
--> 157         download_file(url, temppath, proxies, raise_for_status)
    158         os.replace(temppath, path)
    159     assert_file_exists(path, md5, alternate_md5)

/usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status)
    121         r.raise_for_status()
    122     with open(path, 'wb') as f:
--> 123         file_size = int(r.headers.get('content-length'))
    124         default_chunk_size = 131072
    125         desc = 'Downloading ' + url

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

In [3]:

The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL)

In [3]: import os

In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')
Out[4]: True

In [5]:

download_file() unconditionally parses the HTTP Content-Length header into an integer to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError.

Since requests is an HTTP/1.1-only client, chunked responses cannot be avoided by making HTTP/1.0 requests
All HTTP/1.1 compliant clients are required to handle chunked responses. They cannot be disabled.

To Reproduce Steps to reproduce the behavior:

Define a server via STANZA_RESOURCES_URL that sends the resources_1.x.y.json in a chunked response
python3 -c 'import stanza; stanza.download("en")'
See stack trace

Expected behavior Downloads of resources should work from HTTP/1.1 compliant servers.

Environment (please complete the following information):

OS: Ubuntu 20.04, MacOS 15.2
Python version: Python 3.8.10, Python 3.13.1
Stanza version: 1.4.0, 1.10.1

Additional context We are using stanza in an enterprise setting and can only download resources from a centralized caching server.

Jan 03 '25 14:01 0x64746b

Thank you for reporting. So basically the constant is unnecessary aside from display? Should be an easy fix.

Do you have an example server I could try to verify the fix?

On Fri, Jan 3, 2025, 8:06 AM dtk @.***> wrote:

Describe the bug stanza.download() fails to download resources from a host that sends a chunked response https://en.wikipedia.org/wiki/Chunked_transfer_encoding.

In [1]: import stanza In [2]: stanza.download('en')---------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 stanza.download('en') /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download(lang, model_dir, package, processors, logging_level, verbose, resources_url, resources_branch, resources_version, model_url, proxies, download_json) 577 if not download_json: 578 logger.warning("Asked to skip downloading resources.json, but the file does not exist. Downloading anyway")--> 579 download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath=None, proxies=proxies) 580 581 resources = load_resources_json(model_dir) /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_resources_json(model_dir, resources_url, resources_branch, resources_version, resources_filepath, proxies) 457 resources_filepath = os.path.join(model_dir, 'resources.json') 458 # make request--> 459 request_file( 460 resources_url, 461 resources_filepath, /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in request_file(url, path, proxies, md5, raise_for_status, log_info, alternate_md5) 155 with tempfile.TemporaryDirectory(dir=basedir) as temp: 156 temppath = os.path.join(temp, os.path.split(path)[-1])--> 157 download_file(url, temppath, proxies, raise_for_status) 158 os.replace(temppath, path) 159 assert_file_exists(path, md5, alternate_md5) /usr/local/lib/python3.8/dist-packages/stanza/resources/common.py in download_file(url, path, proxies, raise_for_status) 121 r.raise_for_status() 122 with open(path, 'wb') as f:--> 123 file_size = int(r.headers.get('content-length')) 124 default_chunk_size = 131072 125 desc = 'Downloading ' + url TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' In [3]:

The problem can occur when downloading a resource via stanza.download() from a custom host (set via STANZA_RESOURCES_URL)

In [3]: import os In [4]: not os.environ['STANZA_RESOURCES_URL'].startswith('https://raw.githubusercontent.com/')Out[4]: True In [5]:

download_file() https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L114 unconditionally parses the HTTP Content-Length header into an integer https://github.com/stanfordnlp/stanza/blob/main/stanza/resources/common.py#L123 to optionally visualize a progress bar. However, if the server chose to send a chunked response, it cannot and therefore does not contain a Content-Length header. Passing the None value into int() leads to the above TypeError.

Since requests is an HTTP/1.1-only client https://github.com/psf/requests/issues/5512#issuecomment-647946317, chunked responses cannot be avoided by making HTTP/1.0 requests

All HTTP/1.1 compliant clients are required to handle chunked responses https://stackoverflow.com/questions/31969990/how-to-tell-the-http-server-to-not-send-chunked-encoding/31970668#31970668. They cannot be disabled.

To Reproduce Steps to reproduce the behavior:

Define a server via STANZA_RESOURCES_URL that sends the resources_1.x.y.json in a chunked response

python3 -c 'import stanza; stanza.download("en")'

See stack trace

Expected behavior Downloads of resources should work from HTTP/1.1 compliant servers.

Environment (please complete the following information):

OS: Ubuntu 20.04, MacOS 15.2

Python version: Python 3.8.10, Python 3.13.1

Stanza version: 1.4.0, 1.10.1

Additional context We are using stanza in an enterprise setting and can only download resources from a centralized caching server https://jfrog.com/de/artifactory/.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIV3K7OHNZ6CWIKX4D2I2KQFAVCNFSM6AAAAABURYBQROVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DONRTGMYDSMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jan 03 '25 19:01 AngledLuffa

Have pushed a fix to the dev branch.

How urgent is this fix / can you use the dev branch in your environment? Or only released versions possible? It's actually not super stressful to make a new release as long as the models haven't changed, and nothing's changed in the last week or so

I'm looking into adding a test server framework which will let us unit test the downloads to catch things like this, possibly https://github.com/csernazs/pytest-httpserver

Jan 04 '25 06:01 AngledLuffa

Thanks for the super quick response!

Do you have an example server I could try to verify the fix?

Unfortunately, the server I'm using is only reachable from within the company's network. However, I should be able to validate the fix in a non-productive environment next week.

How urgent is this fix

I have implemented a hacky workaround (by preloading all resources via cURL and disabling all downloads through stanza), so we currently aren't blocked and are happy to wait for the proper release.

Thanks again for the great response! dtk

Jan 04 '25 10:01 0x64746b

I can indeed confirm that the fix works for us:

(stanza-dev) stanza@f66684eeb8db:/tmp$ python3
Python 3.8.10 (default, Nov  7 2024, 13:10:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> stanza.download('en')
Downloading https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 25.6MB/s]
2025-01-06 11:34:11 INFO: Downloaded file to /home/stanza/stanza_resources/resources.json
2025-01-06 11:34:11 INFO: Downloading default packages for language: en (English) ...
Downloading https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip: 100%|███████████████████████████████████████████████████████████████████| 479M/479M [00:47<00:00, 10.2MB/s]
2025-01-06 11:35:00 INFO: Downloaded file to /home/stanza/stanza_resources/en/default.zip
2025-01-06 11:35:03 INFO: Finished downloading models and saved to /home/stanza/stanza_resources

That is for both chunked and un-chunked transfers (note the progress bar for the model download above):

>>> resources_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-resources/main/resources_1.4.0.json')
>>> resources_response.headers.get('transfer-encoding')
'chunked'
>>> 'content-length' in resources_response.headers
False
>>> models_response = requests.get('https://<REMOTE>/stanfordnlp/stanza-en/resolve/v1.4.0/models/default.zip')
>>> models_response.headers.get('content-length')
'479293702'
>>>

Thank you!

Jan 06 '25 12:01 0x64746b

Excellent, glad to hear it

Jan 06 '25 18:01 AngledLuffa

This fix is now in 1.11.0

Oct 05 '25 07:10 AngledLuffa