huggingface_hub
huggingface_hub copied to clipboard
Download speed `Repository`
Performance analysis
Following up on speed comments in #689 I made a more systematic test to see the download speed using Repository
and git clone
. I also added load_dataset
for completeness although it is not entirely fair since it also loads the data after the download.
Download
The following script downloads the lvwerra/abc
dataset (~2GB) with the three approaches and measures the time:
from huggingface_hub import Repository
from datasets import load_dataset
from collections import defaultdict
from time import time
import shutil
import subprocess
import pandas as pd
times = defaultdict(list)
repo_name = "lvwerra/abc"
n = 4
size_mb = 2.11*1024
folder = './tmp'
shutil.rmtree(folder, ignore_errors=True)
command = ['git', 'clone', f'https://huggingface.co/datasets/{repo_name}', folder]
for _ in range(n):
# Datasets
t_start = time()
_ = load_dataset(repo_name, cache_dir=folder)
times["`load_dataset`"].append(time()-t_start)
shutil.rmtree(folder)
# Repository
t_start = time()
repo = Repository(local_dir=folder, clone_from=repo_name, repo_type='dataset')
times["`Repository`"].append(time()-t_start)
shutil.rmtree(folder)
# git-lfs
t_start = time()
subprocess.run(command)
times["`git lfs`"].append(time()-t_start)
shutil.rmtree(folder)
df_time = pd.DataFrame.from_records(times)
df_speed = size_mb/df_time
print("Time:\n"+df_time.to_markdown())
print("Speed:\n"+df_speed.to_markdown())
Time [s]:
Repository |
git lfs |
load_dataset |
|
---|---|---|---|
0 | 128.685 | 11.772 | 88.7151 |
1 | 127.743 | 10.6528 | 91.0475 |
2 | 132.341 | 9.3654 | 92.3481 |
3 | 130.91 | 10.6284 | 92.0453 |
Speed [MB/s]:
Repository |
git lfs |
load_dataset |
|
---|---|---|---|
0 | 16.7902 | 183.541 | 24.3548 |
1 | 16.9139 | 202.823 | 23.7309 |
2 | 16.3263 | 230.704 | 23.3967 |
3 | 16.5048 | 203.289 | 23.4736 |
Clearly git lfs
alone is the fastest approach. Note that load_dataset
also gets ~130MB/s download speed but loading the dataset adds significant amount of time.
Upload
I wrote a similar script for the upload but for some reason the upload progress bars from lfs_log_progress
did not appear and the results between Repository.git_push()
and git push
were much more comparable:
Time [s]:
Repository.git_push() |
git push |
Dataset.push_to_hub |
|
---|---|---|---|
0 | 3.03916 | 2.78973 | 287.404 |
1 | 3.0398 | 2.43085 | 286.857 |
2 | 4.0363 | 3.34182 | 290.62 |
3 | 4.03646 | 2.55472 | 286.578 |
Speed [MB/s]:
Repository.git_push() |
git push |
Dataset.push_to_hub |
|
---|---|---|---|
0 | 710.934 | 774.498 | 7.51778 |
1 | 710.784 | 888.841 | 7.53211 |
2 | 535.302 | 646.547 | 7.4346 |
3 | 535.281 | 845.745 | 7.53946 |
Here the comparison toDataset.push_to_hub
is also not quite fair as push_to_hub
also converts the Dataset
to parquet
format in the process before actually pushing. But I don't think that explains the full 100x difference.
Download without lfs_log_progress
In #689 Repository.git_push
was much slower and showed the progress bars. So I thought to eliminate the only difference and remove lfs_log_progress
from here:
https://github.com/huggingface/huggingface_hub/blob/a2d2c540d89823cefa47a09396cd59a1ffc76b27/src/huggingface_hub/repository.py#L575-L589
Time [s]:
Repository |
git lfs |
load_dataset |
|
---|---|---|---|
0 | 19.5531 | 10.7085 | 95.5541 |
1 | 18.8682 | 11.0337 | 93.9598 |
2 | 56.2453 | 15.4939 | 93.6108 |
3 | 19.0904 | 16.0699 | 90.6005 |
Speed [MB/s]:
Repository |
git lfs |
load_dataset |
|
---|---|---|---|
0 | 110.501 | 201.768 | 22.6117 |
1 | 114.512 | 195.821 | 22.9954 |
2 | 38.4146 | 139.451 | 23.0811 |
3 | 113.18 | 134.452 | 23.848 |
Bottom line
You can see that the difference decreased 5x by just removing the context manager! Not sure however where the remaining 2x slow down come from.
cc @LysandreJik @julien-c @lhoestq
Note that file downloads are cached through a CDN (https://aws.amazon.com/cloudfront/) and as such, your experimental setup can be order-dependent.
In your first test for instance, you should look for a header called x-cache: Miss from cloudfront
or x-cache: Hit from cloudfront
for instance. You should only compare comparable things, as cache misses from Cloudfront are always going to be significantly slower than cache hits.
I agree that it is not super rigorous, but since we run the commands sequentially in the for-loop shouldn't that prevent systematic issues? Also wouldn't explain why without the lfs_log_progress
the times get substantially better, right? Or I am missing something?
~~no because your first call is going to warm the cache and then the subsequent ones will get the cache hits~~
Oh ok I think i see what you mean now – you're running the series of commands n times. Yes, then you're right the for loops with _ > 0
should be comparable then
(closing as "wontfix" as Repository usage is deprecated anyway)