DeepSpeed
DeepSpeed copied to clipboard
[BUG] terminate called after throwing an instance of 'std::bad_alloc'
Describe the bug When I run the code rlhf with trlx using deepspeed with two nodes, I met a strange problem "terminate called after throwing an instance of 'std::bad_alloc'". Memory and video memory are far from used up. Running on a separate machine works fine, but errors occur when two nodes are used. This problem occurs when I run with a docker container, but not when I don't use a container. In addition, I use anaconda environment.
ds_report output (trlx_env) root@9a3cd98dd64f:/data/work/trlx_rlhf/sft# deepspeed --hostfile=../../hostfile train_gptj_summarize.py [2023-04-03 10:49:33,397] [INFO] [runner.py:454:main] Using IP address of 10.0.128.5 for node localhost [2023-04-03 10:49:33,398] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: localhost,deepspeed-18 [2023-04-03 10:49:33,398] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w localhost,deepspeed-18 export PYTHONPATH=/data/work/trlx_rlhf/sft; cd /data/work/trlx_rlhf/sft; /root/mambaforge/envs/trlx_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF0sICJkZWVwc3BlZWQtMTgiOiBbMF19 --node_rank=%n --master_addr=10.0.128.5 --master_port=29500 train_gptj_summarize.py deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]} deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1 deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]}) deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:162:main] dist_world_size=2 deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]} localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0 localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]}) localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:162:main] dist_world_size=2 localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 deepspeed-18: Tokenizer loaded! localhost: Tokenizer loaded! deepspeed-18: Model loaded! deepspeed-18: Downloading and preparing dataset parquet/openai_summarize_tldr to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec... localhost: Model loaded! localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) localhost: Dataset loaded! localhost: [2023-04-03 10:50:46,311] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 10941.66it/s] Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1896.44it/s] deepspeed-18: Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data. Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) deepspeed-18: Dataset loaded! Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 26.9kB/s] deepspeed-18: terminate called after throwing an instance of 'std::bad_alloc' deepspeed-18: what(): std::bad_alloc deepspeed-18: [2023-04-03 10:51:15,307] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1231493 deepspeed-18: [2023-04-03 10:51:15,308] [ERROR] [launch.py:324:sigkill_handler] ['/root/mambaforge/envs/trlx_env/bin/python', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = -6 pdsh@9a3cd98dd64f: deepspeed-18: ssh exited with exit code 250
Hostfile localhost slots=1 deepspeed-18 slots=1
Launcher context deepspeed --hostfile=../../hostfile train_gptj_summarize.py
Docker context This problem occurs when I run with a docker container, but not when I don't use a container.
Please see recent DeepSpeed Chat release #3186 https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat
@shisi-cc, did the link above help? Can this issue be closed? Thanks!
Hi, I have encountered the same issue. I created a docker container on two different machines and ran DeepSpeed-Chat/training/step1_supervised_finetuning/muti_node/run_66b.sh, but I encountered the same error.
hostfile node1 slots=8 node2 slots=8
ds_report_output
[2023-04-26 06:34:22,975] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: node1,node2
[2023-04-26 06:34:22,975] [INFO] [runner.py:540:main] cmd = pdsh -S -f 1024 -w node1,node2 export NCCL_VERSION=2.12.10-1; export PYTHONPATH=/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; cd /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJub2RlMSI6IFszLCA1XSwgIm5vZGUyIjogWzAsIDFdfQ== --node_rank=%n --master_addr=10.176.50.36 --master_port=32783 main.py --data_path 'Dahoas/rm-static' --data_split '2,4,4' --model_name_or_path '/workspace/models/opt-1.3b' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --max_seq_len '512' --learning_rate '9.65e-6' --weight_decay '0.1' --num_train_epochs '2' --gradient_accumulation_steps '1' --lr_scheduler_type 'cosine' --num_warmup_steps '0' --seed '1234' --zero_stage '3' --deepspeed --output_dir './output'
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.12.10-1
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]}
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=0
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]})
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:247:main] dist_world_size=4
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=3,5
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:222:main] 1 NCCL_VERSION=2.12.10-1
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]}
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=1
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]})
node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:247:main] dist_world_size=4
node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
node1: [2023-04-26 06:34:30,128] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
node1: Traceback (most recent call last):
node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 343, in
Launcher context
the ShmSize (shm) size is 10G
deepspeed --hostfile=hostfile
--master_port xxx --master_addr xxx
main.py ....
Both of my nodes can communicate with each other, and they are running inside docker containers. Have you found a solution to this issue yet?
Any updates? I encounted the same problem when finetuning whisper using deepspeed and multiple nodes.