llama
llama copied to clipboard
AssertionError: Loading a checkpoint for MP=2 but world size is 1
Hello,I'm trying to run llama-2-13b-chat with this command: $ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-13b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4
get this error:
Traceback (most recent call last):
File "/home/yyf/llama/example_chat_completion.py", line 73, in
Thanks for any help!
For the 13b model you need to specify --nproc_per_node 2 as per the readme file. However it then fails for me, but differently:
RuntimeError: CUDA error: invalid device ordinal
It works fine with 7b model.
For the 13b model you need to specify --nproc_per_node 2 as per the readme file. However it then fails for me, but differently:
RuntimeError: CUDA error: invalid device ordinal
It works fine with 7b model.
Thanks,it works
I have a single Nvidia 4090 does this seriously mean I can't use the 13B model on a single GPU? I never had problems with a 13B Llama1 model before on a single GPU. Using --nproc_per_node 2
or --nproc_per_node 1
results in the following.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
File "/home/toor/experiments/llama/example_chat_completion.py", line 73, in <module>
fire.Fire(main)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/toor/experiments/llama/example_chat_completion.py", line 20, in main
generator = Llama.build(
File "/home/toor/experiments/llama/llama/generation.py", line 80, in build
assert model_parallel_size == len(
AssertionError: Loading a checkpoint for MP=2 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10884) of binary: /home/toor/experiments/llama/env/bin/python3
Traceback (most recent call last):
File "/home/toor/experiments/llama/env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-19_12:12:01
host : toor-jammy-jellifish
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10884)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(env) toor@toor-jammy-jellifish:~/experiments/llama$ torchrun --nproc_per_node 2 example_chat_completion.py \
--ckpt_dir llama-2-13b-chat/ \
--tokenizer_path tokenizer.model \
--max_seq_len 512 --max_batch_size 4
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
File "/home/toor/experiments/llama/example_chat_completion.py", line 73, in <module>
fire.Fire(main)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/toor/experiments/llama/example_chat_completion.py", line 20, in main
generator = Llama.build(
File "/home/toor/experiments/llama/llama/generation.py", line 69, in build
torch.cuda.set_device(local_rank)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11184 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 11185) of binary: /home/toor/experiments/llama/env/bin/python3
Traceback (most recent call last):
File "/home/toor/experiments/llama/env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-19_12:15:20
host : toor-jammy-jellifish
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 11185)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I also think my GPU can handle 13B, how can I try this?
I have a single Nvidia 4090 does this seriously mean I can't use the 13B model on a single GPU? I never had problems with a 13B Llama1 model before on a single GPU.
I am wondering this as well
I also think my GPU can handle 13B, how can I try this?
Thanks for sharing the issue, I have a server with 2 GPUs. I tested the example as per your suggestion. I can confirm that It works
torchrun --nproc_per_node 2 example_text_completion.py
--ckpt_dir llama-2-13b/
--tokenizer_path tokenizer.model
--max_seq_len 256 --max_batch_size 4
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 10.12 seconds
I believe the meaning of life is
> to find the happiness that comes from within.
I believe that if you want to be happy, you have to make a decision. Whether you consciously make that decision or not, you will be happy with some things and unhappy with others.
I believe that you can’t be happy unless you decide
==================================
etc...
How to change the world parallel number? I have a machine with 6 gpu and I want to try to run 70B model which requires 8 gpu.
@androiddrew did you manage to run it on 1 GPU? I got the same error and was trying on colab.
@kechan I switched to the hugging face 13B and was able to run it on a single 4090 at 8bit quantization. It worked fine.
How to change the world parallel number? I have a machine with 6 gpu and I want to try to run 70B model which requires 8 gpu.
@hzhangxyz Hi, were you able to run it on your 6-gpu machine?
@kechan I switched to the hugging face 13B and was able to run it on a single 4090 at 8bit quantization. It worked fine.
Did Hugging face model combines the two .pth file in 13B model into one .pth file ?
Is there any updates on this?
@kechan I switched to the hugging face 13B and was able to run it on a single 4090 at 8bit quantization. It worked fine.
Can you mention the steps @kechan ?
How to change the world parallel number? I have a machine with 6 gpu and I want to try to run 70B model which requires 8 gpu.
hit the same issue