huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

Download speed `Repository`

Open lvwerra opened this issue 3 years ago • 3 comments

Performance analysis

Following up on speed comments in #689 I made a more systematic test to see the download speed using Repository and git clone. I also added load_dataset for completeness although it is not entirely fair since it also loads the data after the download.

Download

The following script downloads the lvwerra/abc dataset (~2GB) with the three approaches and measures the time:

from huggingface_hub import Repository
from datasets import load_dataset
from collections import defaultdict
from time import time
import shutil
import subprocess
import pandas as pd


times = defaultdict(list)
repo_name = "lvwerra/abc"
n = 4
size_mb = 2.11*1024
folder = './tmp'
shutil.rmtree(folder, ignore_errors=True)
command = ['git', 'clone', f'https://huggingface.co/datasets/{repo_name}', folder]

for _ in range(n):
    # Datasets
    t_start = time()
    _ = load_dataset(repo_name, cache_dir=folder)
    times["`load_dataset`"].append(time()-t_start)
    shutil.rmtree(folder)

    # Repository
    t_start = time()
    repo = Repository(local_dir=folder, clone_from=repo_name, repo_type='dataset')
    times["`Repository`"].append(time()-t_start)
    shutil.rmtree(folder)

    # git-lfs
    t_start = time()
    subprocess.run(command)
    times["`git lfs`"].append(time()-t_start)
    shutil.rmtree(folder)

df_time = pd.DataFrame.from_records(times)
df_speed = size_mb/df_time
print("Time:\n"+df_time.to_markdown())
print("Speed:\n"+df_speed.to_markdown())

Time [s]:

Repository git lfs load_dataset
0 128.685 11.772 88.7151
1 127.743 10.6528 91.0475
2 132.341 9.3654 92.3481
3 130.91 10.6284 92.0453

Speed [MB/s]:

Repository git lfs load_dataset
0 16.7902 183.541 24.3548
1 16.9139 202.823 23.7309
2 16.3263 230.704 23.3967
3 16.5048 203.289 23.4736

Clearly git lfs alone is the fastest approach. Note that load_dataset also gets ~130MB/s download speed but loading the dataset adds significant amount of time.

Upload

I wrote a similar script for the upload but for some reason the upload progress bars from lfs_log_progress did not appear and the results between Repository.git_push() and git push were much more comparable:

Time [s]:

Repository.git_push() git push Dataset.push_to_hub
0 3.03916 2.78973 287.404
1 3.0398 2.43085 286.857
2 4.0363 3.34182 290.62
3 4.03646 2.55472 286.578

Speed [MB/s]:

Repository.git_push() git push Dataset.push_to_hub
0 710.934 774.498 7.51778
1 710.784 888.841 7.53211
2 535.302 646.547 7.4346
3 535.281 845.745 7.53946

Here the comparison toDataset.push_to_hub is also not quite fair as push_to_hub also converts the Dataset to parquet format in the process before actually pushing. But I don't think that explains the full 100x difference.

Download without lfs_log_progress

In #689 Repository.git_push was much slower and showed the progress bars. So I thought to eliminate the only difference and remove lfs_log_progress from here: https://github.com/huggingface/huggingface_hub/blob/a2d2c540d89823cefa47a09396cd59a1ffc76b27/src/huggingface_hub/repository.py#L575-L589

Time [s]:

Repository git lfs load_dataset
0 19.5531 10.7085 95.5541
1 18.8682 11.0337 93.9598
2 56.2453 15.4939 93.6108
3 19.0904 16.0699 90.6005

Speed [MB/s]:

Repository git lfs load_dataset
0 110.501 201.768 22.6117
1 114.512 195.821 22.9954
2 38.4146 139.451 23.0811
3 113.18 134.452 23.848

Bottom line

You can see that the difference decreased 5x by just removing the context manager! Not sure however where the remaining 2x slow down come from.

cc @LysandreJik @julien-c @lhoestq

lvwerra avatar Feb 22 '22 10:02 lvwerra

Note that file downloads are cached through a CDN (https://aws.amazon.com/cloudfront/) and as such, your experimental setup can be order-dependent.

In your first test for instance, you should look for a header called x-cache: Miss from cloudfront or x-cache: Hit from cloudfront for instance. You should only compare comparable things, as cache misses from Cloudfront are always going to be significantly slower than cache hits.

julien-c avatar Feb 22 '22 11:02 julien-c

I agree that it is not super rigorous, but since we run the commands sequentially in the for-loop shouldn't that prevent systematic issues? Also wouldn't explain why without the lfs_log_progress the times get substantially better, right? Or I am missing something?

lvwerra avatar Feb 22 '22 13:02 lvwerra

~~no because your first call is going to warm the cache and then the subsequent ones will get the cache hits~~

Oh ok I think i see what you mean now – you're running the series of commands n times. Yes, then you're right the for loops with _ > 0 should be comparable then

julien-c avatar Feb 22 '22 13:02 julien-c

(closing as "wontfix" as Repository usage is deprecated anyway)

Wauplin avatar Sep 29 '23 15:09 Wauplin