Performance of `datasets` at 1TB scale

What is this?

During the processing of a large dataset I monitored the performance of the datasets library to see if there are any bottlenecks. The insights of this analysis could guide the decision making to improve the performance of the library.

Dataset

The dataset is a 1.1TB extract from GitHub with 120M code files and is stored as 5000 .json.gz files. The goal of the preprocessing is to remove duplicates and filter files based on their stats. While the calculating of the hashes for deduplication and stats for filtering can be parallelized the filtering itself is run with a single process. After processing the files are pushed to the hub.

Machine

The experiment was run on a m1 machine on GCP with 96 CPU cores and 1.3TB RAM.

Performance breakdown

Loading the data 3.5h (30sec from cache)
- 1h57min single core loading (not sure what is going on here, corresponds to second progress bar)
- 1h10min multi core json reading
- 20min remaining time before and after the two main processes mentioned above
Process the data 2h (20min from cache)
- 20min Getting reading for processing
- 40min Hashing and files stats (96 workers)
- 58min Deduplication filtering (single worker)
Save parquet files 5h
- Saving 1000 parquet files (16 workers)
Push to hub 37min
- 34min git add
- 3min git push (several hours with Repository.git_push())

Conclusion

It appears that loading and saving the data is the main bottleneck at that scale (8.5h) whereas processing (2h) and pushing the data to the hub (0.5h) is relatively fast. To optimize the performance at this scale it would make sense to consider such an end-to-end example and target the bottlenecks which seem to be loading from and saving to disk. The processing itself seems to run relatively fast.

Notes

map operation on a 1TB dataset with 96 workers requires >1TB RAM
map operation does not maintain 100% CPU utilization with 96 workers
sometimes when the script crashes all the data files have a corresponding *.lock file in the data folder (or multiple e.g. *.lock.lock when it happened a several times). This causes the cache not to be triggered (which is significant at that scale) - i guess because there are new data files
parallelizing to_parquet decreased the saving time from 17h to 5h, however adding more workers at this point had almost no effect. not sure if this is: a) a bug in my parallelization logic, b) i/o limit to load data form disk to memory or c) i/o limit to write from memory to disk.
Using Repository.git_push() was much slower than using command line git-lfs - 10-20MB/s vs. 300MB/s! The Dataset.push_to_hub() function is even slower as it only uploads one file at a time with only a few MB/s, whereas Repository.git_push() pushes files in parallel (each at a similar speed).

cc @lhoestq @julien-c @LysandreJik @SBrandeis

Feb 16 '22 14:02 lvwerra

using command line git-lfs - [...] 300MB/s!

which server location did you upload from?

Feb 16 '22 14:02 julien-c

From GCP region us-central1-a.

Feb 16 '22 14:02 lvwerra

The most surprising part to me is the saving time. Wondering if it could be due to compression (ParquetWriter uses SNAPPY compression by default; it can be turned off with to_parquet(..., compression=None)).

Feb 16 '22 14:02 mariosasko

+1 to what @mariosasko mentioned. Also, @lvwerra did you parallelize to_parquet using similar approach in #2747? (we used multiprocessing at the shard level). I'm working on a similar PR to add multi_proc in to_parquet which might give you further speed up. Stas benchmarked his approach and mine in this gist for lama dataset when we were working on adding multi_proc support for to_json.

Feb 24 '22 18:02 bhavitvyamalik

@mariosasko I did not turn it off but I can try the next time - I have to run the pipeline again, anyway.

@bhavitvyamalik Yes, I also sharded the dataset and used multiprocessing to save each shard. I'll have a closer look at your approach, too.

Mar 15 '22 09:03 lvwerra

Is there a way to read from the cache files directly as a dataset in its own

Jun 27 '24 01:06 JulesGM

Performance of `datasets` at scale

Performance of datasets at 1TB scale

What is this?

Dataset

Machine

Performance breakdown

Conclusion

Notes

Performance of `datasets` at 1TB scale