huggingface_hub
huggingface_hub copied to clipboard
Content-Range header for multiple part request
I'm developing a library to download files for tar archives on huggingface repository:
- docs and source: https://deepghs.github.io/hfutils/main/api_doc/index/fetch.html#hf-tar-file-download
this is based on the Range
header in http request, so download tar archives with Range: bytes=xxx-yyy
will only download the specific file instead of the full archive file.
In some cases, we need to download many files from different tar archives, and many of them are from the same archive. So im considering using Range: bytes=xxx-yyy,zzz-ttt
to download all of them with only one http request. This can greatly improve the performance of batch downloading, and can also reduce the pressure to the huggingface cdn.
But in my test, when using multiple part ranges, the Content-Range
header seems gone in response.
from pprint import pprint
import requests
resp = requests.get(
'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
headers={
'Range': 'bytes=0-99,1200-1369'
}
)
print(resp)
pprint(dict(resp.headers))
print(len(resp.content))
The output is like this, no Content-Range
found. The length of content seems okay, but i dont know what are the ranges of each part.
<Response [206]>
{'Accept-Ranges': 'bytes',
'Connection': 'keep-alive',
'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
'filename="0008.tar";',
'Content-Length': '570',
'Content-Type': 'multipart/byteranges; '
'boundary=CloudFront:8C171B3C6DAD1DF1040C2DA33E27D04D',
'Date': 'Wed, 24 Apr 2024 14:57:57 GMT',
'ETag': '"820b63e3250678f8217c157c8b557712-135"',
'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
'Server': 'AmazonS3',
'Vary': 'Origin',
'Via': '1.1 db3cc869e0dda88ce4fa37dee230e06e.cloudfront.net (CloudFront)',
'X-Amz-Cf-Id': 'VToeCDfStyG6NtjMCRVWdUqbHvojrQN8a29nE-tgh0zbMNF_80DMEg==',
'X-Amz-Cf-Pop': 'TXL50-P6',
'X-Cache': 'RefreshHit from cloudfront',
'x-amz-server-side-encryption': 'AES256',
'x-amz-storage-class': 'INTELLIGENT_TIERING'}
570
This header information is really important. So can it be added? or is there an alternative solution to download multiple parts at one time, and save each parts to different files?
Cool idea to implement a lazy tar parser on top of HF Hub!! What's the context/goals there?
Re. support for multiple ranges in a single Range request, I think I remember @Kakulukian took a look at this at some point (was this you @Kakulukian?)
@julien-c
In essence, the idea (detailed in the hfutils.index
module) is to create an index for tar files, including offsets, sizes, and file hashes of all files within the tar. This enables downloading specific files and verifying integrity using Range: bytes=xxx-yyy
during retrieval.
Our requirement is to swiftly retrieve a set of specific files from datasets on huggingface. These datasets typically comprise numerous (e.g., 1k) tar archives, each containing numerous image files. The archive in which an image resides depends on the image's id modulo 1000. Notably, one such dataset is nyanko7/danbooru2023, containing roughly 8 million images spread across 2k+ archive files.
In our practical application, we often begin by querying images based on metadata like tags, obtaining a list of required image ids (often over 1k, sometimes exceeding 100k), then fetching all images based on these ids to make a dataset. For this purpose, we're developing a library called cheesechaser. Though still a work in progress, it already supports the aforementioned danbooru2023 dataset. Based on our current tests, downloading 10k specified images (with consecutive ids spread across 1000 archive files) totaling approximately 18gb, using 12 threads took about 17 minutes, involving roughly 10k download requests. This performance is satisfactory, significantly faster than downloading and decompressing approximately 9tb of complete tar archives, with minimal local disk usage.
However, we've identified areas for improvement in performance. Primarily, due to the large volume of download requests and relatively small file sizes, most time is spent establishing connections rather than downloading. Additionally, as the number of downloaded files increases, excessive requests strain huggingface's cdn resources. Therefore, supporting multi-part range requests could significantly boost performance and alleviate pressure on the cdn service by enabling simultaneous downloads of multiple files within the same archive.
Furthermore, after raising this issue and attempting to use multi-part range, we encountered some more problems:
- Current response times are far below expectations.
- When requesting multiple ranges, especially 3-4 widely spaced ranges, response times are excessively long, even for requests with a total content size of only a few hundred bytes.
- Response formats are unclear, making stable decoding difficult.
- Response headers lack consistent Content-Range.
- While I attempted to read the response body, it appears to have a certain format internally. However, the format varies across different runtime environments for the same request, sometimes returning the entire archive file.
When you request multiple ranges, the response will be in a multipart/byteranges content type, including a boundary. Each subsequent range corresponds to a specific block separated by this boundary with content-range header (https://www.rfc-editor.org/rfc/rfc7233#page-21)
For example for your request:
GET https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar HTTP/1.1
Range: bytes=0-99,1200-1369
Response:
HTTP/1.1 206 Partial Content
Content-Length: 570
Content-Range: multipart/byteranges; boundary=CloudFront:725CE26A0B74DDB74002A7B61F84A558
--CloudFront:725CE26A0B74DDB74002A7B61F84A558
Content-Type: application/x-tar
Content-Range: bytes 0-99/2146662400
././@PaxHeader
--CloudFront:725CE26A0B74DDB74002A7B61F84A558
Content-Type: application/x-tar
Content-Range: bytes 1200-1369/2146662400
ustar00runnerdocker00000000000000
--CloudFront:725CE26A0B74DDB74002A7B61F84A558--
While I attempted to read the response body, it appears to have a certain format internally. However, the format varies across different runtime environments for the same request, sometimes returning the entire archive file.
I just reproduce this
Reproduce code
import time
from pprint import pprint
import requests
# ranges to get
ranges = [
(0, 99),
(1200, 1369),
(2000, 2209),
(2146660100, 2146660200),
]
# get ranges with standalone requests
datas = []
for i, (x, y) in enumerate(ranges):
start_time = time.time()
resp = requests.get(
'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
headers={
'Range': f'bytes={x}-{y}'
},
)
print(f'Range {i}, response: {resp!r}, length: {len(resp.content)}, time cost: {time.time() - start_time:.3f}s')
datas.append(bytes(resp.content))
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
# get all the data with one request
start_time = time.time()
resp = requests.get(
'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
headers={
'Range': f'bytes={",".join(map(lambda ix: f"{ix[0]}-{ix[1]}", ranges))}'
},
)
print(f'Multipart response: {resp!r}')
print(f'Time cost: {time.time() - start_time:.3f}s')
print('Headers:')
pprint(dict(resp.headers))
print(f'Content length: {len(resp.content)}')
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
full_bytes = resp.content
start_pos = 0
current_i = 0
while True:
try:
next_sep = full_bytes.index(b'\r\n\r\n', start_pos)
except ValueError:
break
lines = list(filter(bool, full_bytes[start_pos: next_sep].decode().splitlines(keepends=False)))
pairs = [line.split(':', maxsplit=1) for line in lines]
headers = {
key.strip(): value.strip()
for key, value in pairs
}
start_bytes, end_bytes = headers['Content-Range'].split(' ')[-1].split('/')[0].split('-', maxsplit=1)
start_bytes, end_bytes = int(start_bytes), int(end_bytes)
length = end_bytes - start_bytes + 1
current_data = full_bytes[next_sep + 4: next_sep + 4 + length]
start_pos = next_sep + 4 + length
print(f'Multipart, range {current_i}, headers: {headers!r}, byte-ranges: {(start_bytes, end_bytes)}')
assert current_data == datas[current_i], f'Range {current_i} not match!'
print(f'Range {current_i} matched!')
current_i += 1
if current_i < len(datas):
print(f'Range {list(range(current_i, len(datas)))} not matched!')
else:
print('Match success!')
On my local machine
When i run this on my local environment the result is (the time cost of multipart request is really slow, but the result is correct, status code is 206 as expected)
Range 0, response: <Response [206]>, length: 100, time cost: 2.709s
Range 1, response: <Response [206]>, length: 170, time cost: 2.133s
Range 2, response: <Response [206]>, length: 210, time cost: 2.101s
Range 3, response: <Response [206]>, length: 101, time cost: 2.365s
Multipart response: <Response [206]>
Time cost: 23.916s
Headers:
{'Accept-Ranges': 'bytes',
'Connection': 'keep-alive',
'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
'filename="0008.tar";',
'Content-Length': '1147',
'Content-Type': 'multipart/byteranges; '
'boundary=CloudFront:E5D729C94A500F62E0C8D8AF02F938EF',
'Date': 'Thu, 25 Apr 2024 13:33:23 GMT',
'ETag': '"820b63e3250678f8217c157c8b557712-135"',
'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
'Server': 'AmazonS3',
'Vary': 'Origin',
'Via': '1.1 c1ff362c1118e059b545627964cd2e64.cloudfront.net (CloudFront)',
'X-Amz-Cf-Id': 'I3Zj3t7Yn0ndSDNb7q9F3-_2700VGin-UGIZK-Ik9dkZmfkY5Um8Jw==',
'X-Amz-Cf-Pop': 'SFO53-P1',
'X-Cache': 'Miss from cloudfront',
'x-amz-server-side-encryption': 'AES256',
'x-amz-storage-class': 'INTELLIGENT_TIERING'}
Content length: 1147
Multipart, range 0, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 0-99/2146662400'}, byte-ranges: (0, 99)
Range 0 matched!
Multipart, range 1, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 1200-1369/2146662400'}, byte-ranges: (1200, 1369)
Range 1 matched!
Multipart, range 2, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 2000-2209/2146662400'}, byte-ranges: (2000, 2209)
Range 2 matched!
Multipart, range 3, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 2146660100-2146660200/2146662400'}, byte-ranges: (2146660100
, 2146660200)
Range 3 matched!
Match success!
my local env
- huggingface_hub version: 0.22.2
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /data/.hf/token
- Has saved token ?: True
- Who am I ?: narugo
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.24.4
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /data/.hf/hub
- HF_ASSETS_CACHE: /data/.hf/assets
- HF_TOKEN_PATH: /data/.hf/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
On huggingface space
When i run this code on huggingface space (i deployed a jupyterlab in hfspace), the output is (failed, the entire file is returned)
Range 0, response: <Response [206]>, length: 100, time cost: 0.291s
Range 1, response: <Response [206]>, length: 170, time cost: 0.190s
Range 2, response: <Response [206]>, length: 210, time cost: 0.119s
Range 3, response: <Response [206]>, length: 101, time cost: 0.281s
Multipart response: <Response [200]>
Time cost: 23.513s
Headers:
{'Accept-Ranges': 'bytes',
'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
'filename="0008.tar";',
'Content-Length': '2146662400',
'Content-Type': 'application/x-tar',
'Date': 'Thu, 25 Apr 2024 13:33:19 GMT',
'ETag': '"820b63e3250678f8217c157c8b557712-135"',
'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
'Server': 'AmazonS3',
'x-amz-id-2': 'yGuW1BP+wVzZ6c6FgVvrvuBw2vkHDuqskpgpGHFW2t5y9sDFGRNGMi/29Ywf1t3t06aL3ma6MME=',
'x-amz-request-id': 'HBC2WBTYWNDSXDP7',
'x-amz-server-side-encryption': 'AES256',
'x-amz-storage-class': 'INTELLIGENT_TIERING'}
Content length: 2146662400
Traceback (most recent call last):
File "test_main.py", line 41, in <module>
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
AssertionError: Should be 206, but 200 found!
the env
- huggingface_hub version: 0.22.2
- Platform: Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.8.1
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/user/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: narugo
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.0.1
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.24.4
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/user/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/user/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/user/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
So, 2 problems:
- too slow when requesting multipart byteranges
- it will return entire file instead of byteranges in some cases, but i have no idea what triggers this