llm-foundry
llm-foundry copied to clipboard
Fix convert_dataset_hf.py hanging with excessive num_workers
Background: PyTorch's DataLoader hangs on several machines (locally, VM, colab) because of the num_workers argument being excessive. Generally, when using multiple processes, we want to scale with the number of CPUs, and if we always force a minimum of 64 workers, a lot of systems will hang as it simply does not have 64 CPUs.
This pull request does the following:
- Implement
num_workersbased onpsutil.cpu_count(), scale with number of CPUs- FIX:
min(64, dataset.hf_dataset.n_shards)causes the DataLoader to hang on VMs that are not large since it will always force a minimum of 64 workers. - FIX: PyTorch DataLoader gives UserWarning suggesting that your
num_workersshould be equal to the number of CPUs on your machine "excessive worker creation might get DataLoader running slow or even freeze".
- FIX:
- Implement an argument
num_workerswhich sets thenum_workersargument for the DataLoader. If this argument is set to a number, we run with that number instead ofpsutil.cpu_count(). - Enable macOS users to run this locally, the same way as it is implemented for Linux users (both are Unix operating systems).
- FIX:
ValueError: prefetch_factor option could only be specified in multiprocessing.let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.
- FIX:
Usage example:
python data_prep/convert_dataset_hf.py \
--dataset c4 --data_subset en \
--out_root my-copy-c4 --splits train_small val_small \
--concat_tokens 2048 \
--tokenizer EleutherAI/gpt-neox-20b \
--eos_text '<|endoftext|>' \
--num_workers 8
@codestar12 As part of this pull request, do you want me to spread the same implementation to the convert_finetuning_dataset.py script? And remove the build_dataloader from the convert_dataset_json.py file since it is not used in that specific conversion?
hey @casperbh96 this all looks great. Yeah if you could add it to convert_finetuning_dataset.py script, and remove the build_dataloader from the convert_dataset_json.py I'll approve the changes.
@codestar12 Thanks. I have now implemented what we agreed to. Additionally, I have updated the tests to include the num_workers argument.
Note that in the convert_finetuning_dataset.py, the implementation is the same for Linux but slightly different for Mac since there is an incompatibility with multiprocessing and the data loaders in that script. Before this PR, you could not run this script on Mac, so I consider it an improvement although multiprocessing is disabled.