llm-foundry ERROR:composer.cli.launcher:Rank 2 crashed with exit code -7

I am using g5.12xlarge instance on AWS with 96 GB of GPU memory. I am attempting to finetune a model on a custom dataset. To accomplish this, I created a custom preprocessing function in /llmfoundry/data/finetuning/tasks.py script and slightly adjusted the 7b_dolly_sft.yaml file.

I am utilizing the provided Docker image and when I execute the command composer train/train.py train/yamls/mpt/finetune/7b_dolly_sft.yaml, the training script starts up, downloads the model, loads my dataset, and then errors out in “building trainer” with:

ERROR:composer.cli.launcher:Rank 2 crashed with exit code -7.

Update: I just followed your finetuning example but again got the same error:

root@9823knkjb:/llm-foundry/scripts/train# composer train.py finetune_example/gpt2-arc-easy.yaml WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2. Initializing model... cfg.n_params=1.24e+08 Building train loader... Using pad_token, but it is not set yet. Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Building eval loader... Building trainer... ERROR:composer.cli.launcher:Rank 0 crashed with exit code -7. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. Global rank 0 (PID 92) exited with code -7 Global rank 1 (PID 93) exited with code -7 ----------Begin global rank 1 STDOUT---------- WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2. Initializing model... cfg.n_params=1.24e+08 Building train loader... Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice Building eval loader... Building trainer...

----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR----------

Downloading (…)lve/main/config.json: 0%| | 0.00/665 [00:00<?, ?B/s] Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 7.04MB/s]

Downloading (…)olve/main/vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s] Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 80.5MB/s]

Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s] Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 140MB/s]

Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s] Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 138MB/s] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-904ebcd47a9c66c8.arrow

----------End global rank 1 STDERR---------- Global rank 2 (PID 94) exited with code -7 ----------Begin global rank 2 STDOUT---------- WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2. Initializing model... cfg.n_params=1.24e+08 Building train loader... Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice Building eval loader... Building trainer...

----------End global rank 2 STDOUT---------- ----------Begin global rank 2 STDERR----------

Downloading pytorch_model.bin: 0%| | 0.00/548M [00:00<?, ?B/s] Downloading pytorch_model.bin: 4%|▍ | 21.0M/548M [00:00<00:03, 173MB/s] Downloading pytorch_model.bin: 10%|▉ | 52.4M/548M [00:00<00:02, 247MB/s] Downloading pytorch_model.bin: 21%|██ | 115M/548M [00:00<00:01, 390MB/s] Downloading pytorch_model.bin: 33%|███▎ | 178M/548M [00:00<00:00, 454MB/s] Downloading pytorch_model.bin: 44%|████▍ | 241M/548M [00:00<00:00, 489MB/s] Downloading pytorch_model.bin: 55%|█████▌ | 304M/548M [00:00<00:00, 510MB/s] Downloading pytorch_model.bin: 67%|██████▋ | 367M/548M [00:00<00:00, 523MB/s] Downloading pytorch_model.bin: 78%|███████▊ | 430M/548M [00:00<00:00, 533MB/s] Downloading pytorch_model.bin: 90%|████████▉ | 493M/548M [00:01<00:00, 538MB/s] Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:01<00:00, 536MB/s] Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:01<00:00, 486MB/s] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-904ebcd47a9c66c8.arrow

----------End global rank 2 STDERR---------- Global rank 3 (PID 95) exited with code -7 ----------Begin global rank 3 STDOUT---------- WARNING: device_microbatch_size > device_batch_size, will be reduced from 8 -> 2. Initializing model... cfg.n_params=1.24e+08 Building train loader... Importing preprocessing function via: from finetune_example.preprocessing import multiple_choice Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51... Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ef475c8714b1bfea/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data. Building eval loader... Building trainer...

----------End global rank 3 STDOUT---------- ----------Begin global rank 3 STDERR----------

Downloading (…)neration_config.json: 0%| | 0.00/124 [00:00<?, ?B/s] Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 1.12MB/s] Using pad_token, but it is not set yet.

Downloading data files: 0%| | 0/1 [00:00<?, ?it/s] Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 10305.42it/s]

Extracting data files: 0%| | 0/1 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 927.12it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map: 0%| | 0/100 [00:00<?, ? examples/s]

----------End global rank 3 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 92) exited with code -7

May 31 '23 08:05 tb852

We have not battle tested MPT-7B finetuning on g5.12xlarge instance (A10) instances, as most of our internal benchmarking was done on A100s.

You might find this thread helpful #82; they were also using a g5.12xlarge instance. See also https://github.com/mosaicml/llm-foundry/issues/143#issuecomment-1569346266

It is possible that this is a shared memory issue.

Jun 01 '23 00:06 jacobfulano

I have the same problem but im using a ml.g4dn.12xlarge (4x Tesla T4). I'm running it on AWS Sagemaker inside a docker (recommended docker image) and im using the mpt-7b_dolly_sft.yaml

Jun 02 '23 08:06 Einengutenmorgen

Closing this issue as stale. We have not tested on T4s or Sagemaker. Please open a new issue if you are still encountering problems.

Sep 07 '23 02:09 dakinggg

llm-foundry llm-foundry copied to clipboard

ERROR:composer.cli.launcher:Rank 2 crashed with exit code -7

llm-foundry
llm-foundry copied to clipboard