DeepSpeed ModuleNotFoundError with Multi-node training using SLURM

I am trying to train models on multiple nodes with SLURM as a workload manager. The Issue seems to be with the Python virtual environment not available to all nodes. Please find more details below.

Job script:

#!/bin/bash
#SBATCH --time=10:00
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --cpus-per-task=48
#SBATCH --gres=gpu:4
#SBATCH --mem=0

export NPROC_PER_NODE=4
export OUTPUT_DIR=./output/

export NCCL_DEBUG=INFO
export HDF5_USE_FILE_LOCKING='FALSE'
export PARENT=`/bin/hostname -s`
export MPORT=13001
export CHILDREN=`scontrol show hostnames $SLURM_JOB_NODELIST | grep -v $PARENT`
export HOSTLIST="$PARENT $CHILDREN"
echo $HOSTLIST
export WORLD_SIZE=$SLURM_NTASKS

module load gcc arrow python/3.8.10 ffmpeg/4.3.2 cuda
source ~/venv/bin/activate

srun distributed_runner_ds.sh

Training script (distributed_runner_ds.sh)

#!/bin/bash
/bin/hostname -s
export NCCL_BLOCKING_WAIT=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0

#replaces the content of hostfile every time
function makehostfile() {
perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"};
$slots=4 if $slots==0; # workaround 8 gpu machines
@nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}];
print map { "$b$_ slots=$slots\n" } @nodes'
}
makehostfile > hostfile

deepspeed --num_gpus=$(($NPROC_PER_NODE * $SLURM_JOB_NUM_NODES)) --num_nodes=$SLURM_JOB_NUM_NODES  --master_addr="$PARENT" --master_port="$MPORT" --hostfile hostfile train.py \
    --model_name_or_path "EleutherAI/gpt-j-6b" \
    --data_path mbzuai-distil/instruction \
    --output_dir ./output/ \
    --cache_dir ./cache \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing \
    --report_to="none" \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 100 \
    --deepspeed "ds_config2.json" \
    --debugging True \

Hostfile:

ng30905 slots=4
ng31103 slots=4

Logs:

nohup: ignoring input
ng30905 ng31103
ng30905
ng31103
Num of node, 2
Num of GPU per node, 4
PROCID: 0
LOCALID: 0
Num of node, 2
Num of GPU per node, 4
PROCID: 1
LOCALID: 0
[2023-05-03 10:46:21,278] [INFO] [multinode_runner.py:67:get_cmd] Running on the following workers: ng30905,ng31103
[2023-05-03 10:46:21,279] [INFO] [runner.py:550:main] cmd = pdsh -S -f 1024 -w ng30905,ng31103 export _NCCL_BLOCKING_WAIT=1; export NCCL_IB_DISABLE=1; export PYTHONPATH=/lustre07/scratch/awaheed/InstructTuning:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=eth0;  cd /lustre07/scratch/awaheed/InstructTuning; /home/awaheed/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJuZzMwOTA1IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAibmczMTEwMyI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=ng30905 --master_port=29500 train.py --model_name_or_path 'EleutherAI/gpt-j-6b' --data_path 'mbzuai-distil/instruction' --output_dir './output/' --cache_dir './cache' --num_train_epochs '5' --per_device_train_batch_size '8' --per_device_eval_batch_size '8' --gradient_accumulation_steps '4' --gradient_checkpointing --report_to=none --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '1000' --learning_rate '2e-5' --weight_decay '0.' --warmup_ratio '0.03' --lr_scheduler_type 'cosine' --logging_steps '100' --deepspeed 'ds_config2.json' --debugging 'True'_
[2023-05-03 10:46:21,466] [INFO] [multinode_runner.py:67:get_cmd] Running on the following workers: ng30905,ng31103
[2023-05-03 10:46:21,467] [INFO] [runner.py:550:main] cmd = pdsh -S -f 1024 -w ng30905,ng31103 export NCCL_BLOCKING_WAIT=1; export NCCL_IB_DISABLE=1; export PYTHONPATH=/lustre07/scratch/awaheed/InstructTuning:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/home/awaheed/venv/lib/python3.8/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=eth0;  cd /lustre07/scratch/awaheed/InstructTuning; /home/awaheed/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJuZzMwOTA1IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAibmczMTEwMyI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=ng30905 --master_port=29500 train.py --model_name_or_path 'EleutherAI/gpt-j-6b' --data_path 'mbzuai-distil/instruction' --output_dir './output/' --cache_dir './cache' --num_train_epochs '5' --per_device_train_batch_size '8' --per_device_eval_batch_size '8' --gradient_accumulation_steps '4' --gradient_checkpointing --report_to=none --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '1000' --learning_rate '2e-5' --weight_decay '0.' --warmup_ratio '0.03' --lr_scheduler_type 'cosine' --logging_steps '100' --deepspeed 'ds_config2.json' --debugging 'True'
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_BLOCKING_WAIT=1
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_IB_DISABLE=1
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_DEBUG=INFO
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:135:main] 0 NCCL_SOCKET_IFNAME=eth0
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=0
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:162:main] dist_world_size=16
ng30905: [2023-05-03 10:46:23,766] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_BLOCKING_WAIT=1
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_IB_DISABLE=1
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_DEBUG=INFO
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:135:main] 1 NCCL_SOCKET_IFNAME=eth0
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=1
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:162:main] dist_world_size=16
ng31103: [2023-05-03 10:46:23,909] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_BLOCKING_WAIT=1
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_IB_DISABLE=1
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_DEBUG=INFO
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:135:main] 0 NCCL_SOCKET_IFNAME=eth0
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng30905: [2023-05-03 10:46:23,986] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=0
ng30905: [2023-05-03 10:46:23,987] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng30905: [2023-05-03 10:46:23,987] [INFO] [launch.py:162:main] dist_world_size=16
ng30905: [2023-05-03 10:46:23,987] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_BLOCKING_WAIT=1
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_IB_DISABLE=1
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_DEBUG=INFO
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:135:main] 1 NCCL_SOCKET_IFNAME=eth0
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:142:main] WORLD INFO DICT: {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [0, 1, 2, 3, 4, 5, 6, 7]}
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=8, node_rank=1
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'ng30905': [0, 1, 2, 3, 4, 5, 6, 7], 'ng31103': [8, 9, 10, 11, 12, 13, 14, 15]})
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:162:main] dist_world_size=16
ng31103: [2023-05-03 10:46:24,072] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ng30905: Traceback (most recent call last):
ng30905:   File "/home/awaheed/venv/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1146, in _get_module
ng30905:     return importlib.import_module("." + module_name, self.__name__)
ng30905:   File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/importlib/__init__.py", line 127, in import_module
ng30905:     return _bootstrap._gcd_import(name[level:], package, level)
ng30905:   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
ng30905:   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
ng30905:   File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
ng30905:   File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
ng30905:   File "<frozen importlib._bootstrap_external>", line 848, in exec_module
ng30905:   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ng30905:   File "/home/awaheed/venv/lib/python3.8/site-packages/transformers/trainer.py", line 176, in <module>
ng30905:     import datasets
ng30905:   File "/home/awaheed/venv/lib/python3.8/site-packages/datasets/__init__.py", line 24, in <module>
ng30905:     import pyarrow
ng30905: ModuleNotFoundError: No module named 'pyarrow'

Mode Details:

I have tried running deepspeed with --launcher=="SLURM" (mentioned here: #3419 ) with the same outcome.
ds_report is fine.
It works with only one node.
Tried with interactive job session with two nodes yet same outcome.

CC: @loadams @tjruwase @RezaYazdaniAminabadi @HeyangQin @jeffra @ShadenSmith @samyam @molly-smith @arashashari @arashb Help is much appreciated. Thanks.

May 08 '23 10:05 macabdul9

I'm facing a similar situation with you. I tried to fine tune ChatGLM (a Chinese LLM) via ds inside Slurm, using only one node with 4 gpus (sbatch --gpus=4 xxx.sh), it seems DeepSpeed is using multithread to revoke main() in the python script 4 times, so I always got FileNotFoundError when initializing tokenizer and model from the cache files，which are indeed there. I believe this error is caused by thread conflict, cause when I set the CUDA_VISIBLE_DEVICES to only one, all the FileNotFoundError is just gone but the OOM Error, and when the gpus add up to 2, the FileNotFoundError appears sometimes depending on the thread conflict. And here is the environment:

OS: CentOS 7
Python: 3.9.7
Transformers:4.27.1
PyTorch: 1.13.1+cu117
deepspeed: 0.8.1
CUDA Support: True

Here is the eroor I encountered: Here is the slurm script:

May 08 '23 12:05 Sparklexs

For me, multi-GPU on a single node works fine. I get that error when I try to train on multiple nodes where not all the nodes have access to the correct virtual environment. @loadams @tjruwase @RezaYazdaniAminabadi @HeyangQin

May 08 '23 13:05 macabdul9

Please help @jeffra @ShadenSmith @samyam @molly-smith @arashashari @arashb

May 11 '23 08:05 macabdul9