llm-foundry
llm-foundry copied to clipboard
ERROR:composer.cli.launcher:Rank 2 crashed with exit code -7
I am using g5.12xlarge instance on AWS with 96 GB of GPU memory. I am attempting to finetune a model on a custom dataset. To accomplish this, I created a custom preprocessing function in /llmfoundry/data/finetuning/tasks.py script and slightly adjusted the 7b_dolly_sft.yaml file.
I am utilizing the provided Docker image and when I execute the command composer train/train.py train/yamls/mpt/finetune/7b_dolly_sft.yaml, the training script starts up, downloads the model, loads my dataset, and then errors out in “building trainer” with:
ERROR:composer.cli.launcher:Rank 2 crashed with exit code -7.
Update: I just followed your finetuning example but again got the same error:
root@9823knkjb:/llm-foundry/scripts/train# composer train.py finetune_example/gpt2-arc-easy.yaml
WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2.
Initializing model...
cfg.n_params=1.24e+08
Building train loader...
Using pad_token, but it is not set yet.
Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Building eval loader...
Building trainer...
ERROR:composer.cli.launcher:Rank 0 crashed with exit code -7.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 92) exited with code -7
Global rank 1 (PID 93) exited with code -7
----------Begin global rank 1 STDOUT----------
WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2.
Initializing model...
cfg.n_params=1.24e+08
Building train loader...
Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice
Building eval loader...
Building trainer...
----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR----------
Downloading (…)lve/main/config.json: 0%| | 0.00/665 [00:00<?, ?B/s] Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 7.04MB/s]
Downloading (…)olve/main/vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s] Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 80.5MB/s]
Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s] Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 140MB/s]
Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s] Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 138MB/s] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-904ebcd47a9c66c8.arrow
----------End global rank 1 STDERR----------
Global rank 2 (PID 94) exited with code -7
----------Begin global rank 2 STDOUT----------
WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2.
Initializing model...
cfg.n_params=1.24e+08
Building train loader...
Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice
Building eval loader...
Building trainer...
----------End global rank 2 STDOUT---------- ----------Begin global rank 2 STDERR----------
Downloading pytorch_model.bin: 0%| | 0.00/548M [00:00<?, ?B/s] Downloading pytorch_model.bin: 4%|▍ | 21.0M/548M [00:00<00:03, 173MB/s] Downloading pytorch_model.bin: 10%|▉ | 52.4M/548M [00:00<00:02, 247MB/s] Downloading pytorch_model.bin: 21%|██ | 115M/548M [00:00<00:01, 390MB/s] Downloading pytorch_model.bin: 33%|███▎ | 178M/548M [00:00<00:00, 454MB/s] Downloading pytorch_model.bin: 44%|████▍ | 241M/548M [00:00<00:00, 489MB/s] Downloading pytorch_model.bin: 55%|█████▌ | 304M/548M [00:00<00:00, 510MB/s] Downloading pytorch_model.bin: 67%|██████▋ | 367M/548M [00:00<00:00, 523MB/s] Downloading pytorch_model.bin: 78%|███████▊ | 430M/548M [00:00<00:00, 533MB/s] Downloading pytorch_model.bin: 90%|████████▉ | 493M/548M [00:01<00:00, 538MB/s] Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:01<00:00, 536MB/s] Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:01<00:00, 486MB/s] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-904ebcd47a9c66c8.arrow
----------End global rank 2 STDERR----------
Global rank 3 (PID 95) exited with code -7
----------Begin global rank 3 STDOUT----------
WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2.
Initializing model...
cfg.n_params=1.24e+08
Building train loader...
Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Building eval loader...
Building trainer...
----------End global rank 3 STDOUT---------- ----------Begin global rank 3 STDERR----------
Downloading (…)neration_config.json: 0%| | 0.00/124 [00:00<?, ?B/s] Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 1.12MB/s] Using pad_token, but it is not set yet.
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s] Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 10305.42it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 927.12it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Map: 0%| | 0/100 [00:00<?, ? examples/s]
----------End global rank 3 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 92) exited with code -7
We have not battle tested MPT-7B finetuning on g5.12xlarge instance (A10) instances, as most of our internal benchmarking was done on A100s.
You might find this thread helpful #82; they were also using a g5.12xlarge instance. See also https://github.com/mosaicml/llm-foundry/issues/143#issuecomment-1569346266
It is possible that this is a shared memory issue.
I have the same problem but im using a ml.g4dn.12xlarge (4x Tesla T4). I'm running it on AWS Sagemaker inside a docker (recommended docker image) and im using the mpt-7b_dolly_sft.yaml
Closing this issue as stale. We have not tested on T4s or Sagemaker. Please open a new issue if you are still encountering problems.