llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Fix convert_dataset_hf.py hanging with excessive num_workers

Open casper-hansen opened this issue 2 years ago • 3 comments

Background: PyTorch's DataLoader hangs on several machines (locally, VM, colab) because of the num_workers argument being excessive. Generally, when using multiple processes, we want to scale with the number of CPUs, and if we always force a minimum of 64 workers, a lot of systems will hang as it simply does not have 64 CPUs.

This pull request does the following:

  1. Implement num_workers based on psutil.cpu_count(), scale with number of CPUs
    • FIX: min(64, dataset.hf_dataset.n_shards) causes the DataLoader to hang on VMs that are not large since it will always force a minimum of 64 workers.
    • FIX: PyTorch DataLoader gives UserWarning suggesting that your num_workers should be equal to the number of CPUs on your machine "excessive worker creation might get DataLoader running slow or even freeze".
  2. Implement an argument num_workers which sets the num_workers argument for the DataLoader. If this argument is set to a number, we run with that number instead of psutil.cpu_count().
  3. Enable macOS users to run this locally, the same way as it is implemented for Linux users (both are Unix operating systems).
    • FIX: ValueError: prefetch_factor option could only be specified in multiprocessing.let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.

Usage example:

python data_prep/convert_dataset_hf.py \
  --dataset c4 --data_subset en \
  --out_root my-copy-c4 --splits train_small val_small \
  --concat_tokens 2048 \
  --tokenizer EleutherAI/gpt-neox-20b \
  --eos_text '<|endoftext|>' \
  --num_workers 8

casper-hansen avatar Jun 02 '23 15:06 casper-hansen

@codestar12 As part of this pull request, do you want me to spread the same implementation to the convert_finetuning_dataset.py script? And remove the build_dataloader from the convert_dataset_json.py file since it is not used in that specific conversion?

casper-hansen avatar Jun 02 '23 17:06 casper-hansen

hey @casperbh96 this all looks great. Yeah if you could add it to convert_finetuning_dataset.py script, and remove the build_dataloader from the convert_dataset_json.py I'll approve the changes.

codestar12 avatar Jun 07 '23 13:06 codestar12

@codestar12 Thanks. I have now implemented what we agreed to. Additionally, I have updated the tests to include the num_workers argument.

Note that in the convert_finetuning_dataset.py, the implementation is the same for Linux but slightly different for Mac since there is an incompatibility with multiprocessing and the data loaders in that script. Before this PR, you could not run this script on Mac, so I consider it an improvement although multiprocessing is disabled.

casper-hansen avatar Jun 07 '23 18:06 casper-hansen