Performance of `datasets` at scale
Performance of datasets at 1TB scale
What is this?
During the processing of a large dataset I monitored the performance of the datasets library to see if there are any bottlenecks. The insights of this analysis could guide the decision making to improve the performance of the library.
Dataset
The dataset is a 1.1TB extract from GitHub with 120M code files and is stored as 5000 .json.gz files. The goal of the preprocessing is to remove duplicates and filter files based on their stats. While the calculating of the hashes for deduplication and stats for filtering can be parallelized the filtering itself is run with a single process. After processing the files are pushed to the hub.
Machine
The experiment was run on a m1 machine on GCP with 96 CPU cores and 1.3TB RAM.
Performance breakdown
- Loading the data 3.5h (30sec from cache)
- 1h57min single core loading (not sure what is going on here, corresponds to second progress bar)
- 1h10min multi core json reading
- 20min remaining time before and after the two main processes mentioned above
- Process the data 2h (20min from cache)
- 20min Getting reading for processing
- 40min Hashing and files stats (96 workers)
- 58min Deduplication filtering (single worker)
- Save parquet files 5h
- Saving 1000 parquet files (16 workers)
- Push to hub 37min
- 34min git add
-
3min git push (several hours with
Repository.git_push())
Conclusion
It appears that loading and saving the data is the main bottleneck at that scale (8.5h) whereas processing (2h) and pushing the data to the hub (0.5h) is relatively fast. To optimize the performance at this scale it would make sense to consider such an end-to-end example and target the bottlenecks which seem to be loading from and saving to disk. The processing itself seems to run relatively fast.
Notes
- map operation on a 1TB dataset with 96 workers requires >1TB RAM
- map operation does not maintain 100% CPU utilization with 96 workers
- sometimes when the script crashes all the data files have a corresponding
*.lockfile in the data folder (or multiple e.g.*.lock.lockwhen it happened a several times). This causes the cache not to be triggered (which is significant at that scale) - i guess because there are new data files - parallelizing
to_parquetdecreased the saving time from 17h to 5h, however adding more workers at this point had almost no effect. not sure if this is: a) a bug in my parallelization logic, b) i/o limit to load data form disk to memory or c) i/o limit to write from memory to disk. - Using
Repository.git_push()was much slower than using command linegit-lfs- 10-20MB/s vs. 300MB/s! TheDataset.push_to_hub()function is even slower as it only uploads one file at a time with only a few MB/s, whereasRepository.git_push()pushes files in parallel (each at a similar speed).
cc @lhoestq @julien-c @LysandreJik @SBrandeis
using command line git-lfs - [...] 300MB/s!
which server location did you upload from?
From GCP region us-central1-a.
The most surprising part to me is the saving time. Wondering if it could be due to compression (ParquetWriter uses SNAPPY compression by default; it can be turned off with to_parquet(..., compression=None)).
+1 to what @mariosasko mentioned. Also, @lvwerra did you parallelize to_parquet using similar approach in #2747? (we used multiprocessing at the shard level). I'm working on a similar PR to add multi_proc in to_parquet which might give you further speed up.
Stas benchmarked his approach and mine in this gist for lama dataset when we were working on adding multi_proc support for to_json.
@mariosasko I did not turn it off but I can try the next time - I have to run the pipeline again, anyway.
@bhavitvyamalik Yes, I also sharded the dataset and used multiprocessing to save each shard. I'll have a closer look at your approach, too.
Is there a way to read from the cache files directly as a dataset in its own