FasterTransformer
FasterTransformer copied to clipboard
error with mpirun
trafficstars
Branch/Tag/Commit
9b6d718b52f10f08a810c0885e070789e462102b
Docker Image Version
nvcr.io/nvidia/pytorch:22.09-py3
GPU name
V100
CUDA Driver
Driver Version: 510.73.08
Reproduced Steps
1 I use the script to convert my model
python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \
-i opt-6.7b/ \
-o opt-6.7b/c-model/ \
-i_g 4 \
-processes 8 \
-weight_data_type fp16
python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \
-i opt-6.7b/ \
-o opt-6.7b/c-model/ \
-i_g 8 \
-processes 8 \
-weight_data_type fp16
2 then I use the script to run my code
mpirun -n 4 --allow-run-as-root python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
--tensor_para_size 2 \
--pipeline_para_size 2 \
--layer_num 32 \
--input_len 32 \
--head_num 32 \
--size_per_head 128 \
--weights_data_type "fp16" \
--max_seq_len 2048 \
--vocab_size 50272 \
--vocab_file ../models/gpt2-vocab.json \
--merges_file ../models/gpt2-merges.txt \
--ckpt_path="/home/aiscuser/FasterTransformer/build/opt-6.7b/c-model/4-gpu" > 4gpu.log 2>&1
3 my error is
=================================================
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
main()
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
dist.init_process_group(backend=backend)
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
default_pg = _new_process_group_helper(
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
main()
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
dist.init_process_group(backend=backend)
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
default_pg = _new_process_group_helper(
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
main()
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
dist.init_process_group(backend=backend)
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
default_pg = _new_process_group_helper(
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Initializing tensor and pipeline parallel...
Traceback (most recent call last):
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 364, in <module>
main()
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "../examples/pytorch/gpt/multi_gpu_gpt_example.py", line 219, in main
comm.initialize_model_parallel(args.tensor_para_size, args.pipeline_para_size)
File "/home/aiscuser/FasterTransformer/examples/pytorch/gpt/../../../examples/pytorch/gpt/utils/comm.py", line 86, in initialize_model_parallel
dist.init_process_group(backend=backend)
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 744, in init_process_group
default_pg = _new_process_group_helper(
File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 837, in _new_process_group_helper
raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[10072,1],0]
I use the official image to build it
Can you run any program with MPI in and outside the docker?
Thanks, I fix the problem
I encounter the similar problem, how do you fix it? RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
@lambda7xx how do you fix it?