Performance analysis

Following up on speed comments in #689 I made a more systematic test to see the download speed using Repository and git clone. I also added load_dataset for completeness although it is not entirely fair since it also loads the data after the download.

Download

The following script downloads the lvwerra/abc dataset (~2GB) with the three approaches and measures the time:

from huggingface_hub import Repository
from datasets import load_dataset
from collections import defaultdict
from time import time
import shutil
import subprocess
import pandas as pd


times = defaultdict(list)
repo_name = "lvwerra/abc"
n = 4
size_mb = 2.11*1024
folder = './tmp'
shutil.rmtree(folder, ignore_errors=True)
command = ['git', 'clone', f'https://huggingface.co/datasets/{repo_name}', folder]

for _ in range(n):
    # Datasets
    t_start = time()
    _ = load_dataset(repo_name, cache_dir=folder)
    times["`load_dataset`"].append(time()-t_start)
    shutil.rmtree(folder)

    # Repository
    t_start = time()
    repo = Repository(local_dir=folder, clone_from=repo_name, repo_type='dataset')
    times["`Repository`"].append(time()-t_start)
    shutil.rmtree(folder)

    # git-lfs
    t_start = time()
    subprocess.run(command)
    times["`git lfs`"].append(time()-t_start)
    shutil.rmtree(folder)

df_time = pd.DataFrame.from_records(times)
df_speed = size_mb/df_time
print("Time:\n"+df_time.to_markdown())
print("Speed:\n"+df_speed.to_markdown())

Time [s]:

	`Repository`	`git lfs`	`load_dataset`
0	128.685	11.772	88.7151
1	127.743	10.6528	91.0475
2	132.341	9.3654	92.3481
3	130.91	10.6284	92.0453

Speed [MB/s]:

	`Repository`	`git lfs`	`load_dataset`
0	16.7902	183.541	24.3548
1	16.9139	202.823	23.7309
2	16.3263	230.704	23.3967
3	16.5048	203.289	23.4736

Clearly git lfs alone is the fastest approach. Note that load_dataset also gets ~130MB/s download speed but loading the dataset adds significant amount of time.

Upload

I wrote a similar script for the upload but for some reason the upload progress bars from lfs_log_progress did not appear and the results between Repository.git_push() and git push were much more comparable:

Time [s]:

	`Repository.git_push()`	`git push`	`Dataset.push_to_hub`
0	3.03916	2.78973	287.404
1	3.0398	2.43085	286.857
2	4.0363	3.34182	290.62
3	4.03646	2.55472	286.578

Speed [MB/s]:

	`Repository.git_push()`	`git push`	`Dataset.push_to_hub`
0	710.934	774.498	7.51778
1	710.784	888.841	7.53211
2	535.302	646.547	7.4346
3	535.281	845.745	7.53946

Here the comparison toDataset.push_to_hub is also not quite fair as push_to_hub also converts the Dataset to parquet format in the process before actually pushing. But I don't think that explains the full 100x difference.

Download without `lfs_log_progress`

In #689 Repository.git_push was much slower and showed the progress bars. So I thought to eliminate the only difference and remove lfs_log_progress from here: https://github.com/huggingface/huggingface_hub/blob/a2d2c540d89823cefa47a09396cd59a1ffc76b27/src/huggingface_hub/repository.py#L575-L589

Time [s]:

	`Repository`	`git lfs`	`load_dataset`
0	19.5531	10.7085	95.5541
1	18.8682	11.0337	93.9598
2	56.2453	15.4939	93.6108
3	19.0904	16.0699	90.6005

Speed [MB/s]:

	`Repository`	`git lfs`	`load_dataset`
0	110.501	201.768	22.6117
1	114.512	195.821	22.9954
2	38.4146	139.451	23.0811
3	113.18	134.452	23.848

Bottom line

You can see that the difference decreased 5x by just removing the context manager! Not sure however where the remaining 2x slow down come from.

cc @LysandreJik @julien-c @lhoestq

Feb 22 '22 10:02 lvwerra

Note that file downloads are cached through a CDN (https://aws.amazon.com/cloudfront/) and as such, your experimental setup can be order-dependent.

In your first test for instance, you should look for a header called x-cache: Miss from cloudfront or x-cache: Hit from cloudfront for instance. You should only compare comparable things, as cache misses from Cloudfront are always going to be significantly slower than cache hits.

Feb 22 '22 11:02 julien-c

I agree that it is not super rigorous, but since we run the commands sequentially in the for-loop shouldn't that prevent systematic issues? Also wouldn't explain why without the lfs_log_progress the times get substantially better, right? Or I am missing something?

Feb 22 '22 13:02 lvwerra

~~no because your first call is going to warm the cache and then the subsequent ones will get the cache hits~~

Oh ok I think i see what you mean now – you're running the series of commands n times. Yes, then you're right the for loops with _ > 0 should be comparable then

Feb 22 '22 13:02 julien-c

(closing as "wontfix" as Repository usage is deprecated anyway)

Sep 29 '23 15:09 Wauplin

huggingface_hub
huggingface_hub copied to clipboard

Download speed `Repository`

Performance analysis

Download

Time [s]:

Speed [MB/s]:

Upload

Time [s]:

Speed [MB/s]:

Download without `lfs_log_progress`

Time [s]:

Speed [MB/s]:

Bottom line

huggingface_hub huggingface_hub copied to clipboard

Download speed `Repository`

Performance analysis

Download

Time [s]:

Speed [MB/s]:

Upload

Time [s]:

Speed [MB/s]:

Download without lfs_log_progress

Time [s]:

Speed [MB/s]:

Bottom line

huggingface_hub
huggingface_hub copied to clipboard

Download without `lfs_log_progress`