llama AssertionError: Loading a checkpoint for MP=2 but world size is 1

Hello,I'm trying to run llama-2-13b-chat with this command: $ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-13b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

get this error:

Traceback (most recent call last): File "/home/yyf/llama/example_chat_completion.py", line 73, in fire.Fire(main) File "/home/yyf/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/yyf/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/yyf/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/yyf/llama/example_chat_completion.py", line 20, in main generator = Llama.build( File "/home/yyf/llama/llama/generation.py", line 80, in build assert model_parallel_size == len( AssertionError: Loading a checkpoint for MP=2 but world size is 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2219637) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/yyf/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/yyf/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/yyf/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/yyf/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/yyf/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/yyf/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Thanks for any help!

Jul 19 '23 15:07 yhfgyyf

For the 13b model you need to specify --nproc_per_node 2 as per the readme file. However it then fails for me, but differently:

RuntimeError: CUDA error: invalid device ordinal

It works fine with 7b model.

Jul 19 '23 15:07 rmilkowski

For the 13b model you need to specify --nproc_per_node 2 as per the readme file. However it then fails for me, but differently:

RuntimeError: CUDA error: invalid device ordinal

It works fine with 7b model.

Thanks，it works

Jul 19 '23 16:07 yhfgyyf

I have a single Nvidia 4090 does this seriously mean I can't use the 13B model on a single GPU? I never had problems with a 13B Llama1 model before on a single GPU. Using --nproc_per_node 2 or --nproc_per_node 1 results in the following.

> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/toor/experiments/llama/example_chat_completion.py", line 73, in <module>
    fire.Fire(main)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/toor/experiments/llama/example_chat_completion.py", line 20, in main
    generator = Llama.build(
  File "/home/toor/experiments/llama/llama/generation.py", line 80, in build
    assert model_parallel_size == len(
AssertionError: Loading a checkpoint for MP=2 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10884) of binary: /home/toor/experiments/llama/env/bin/python3
Traceback (most recent call last):
  File "/home/toor/experiments/llama/env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-19_12:12:01
  host      : toor-jammy-jellifish
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 10884)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(env) toor@toor-jammy-jellifish:~/experiments/llama$ torchrun --nproc_per_node 2 example_chat_completion.py \
    --ckpt_dir llama-2-13b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 4
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/toor/experiments/llama/example_chat_completion.py", line 73, in <module>
    fire.Fire(main)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/toor/experiments/llama/example_chat_completion.py", line 20, in main
    generator = Llama.build(
  File "/home/toor/experiments/llama/llama/generation.py", line 69, in build
    torch.cuda.set_device(local_rank)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11184 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 11185) of binary: /home/toor/experiments/llama/env/bin/python3
Traceback (most recent call last):
  File "/home/toor/experiments/llama/env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/toor/experiments/llama/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-19_12:15:20
  host      : toor-jammy-jellifish
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 11185)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Jul 19 '23 16:07 androiddrew

I also think my GPU can handle 13B, how can I try this?

Jul 19 '23 17:07 ibnbd

I have a single Nvidia 4090 does this seriously mean I can't use the 13B model on a single GPU? I never had problems with a 13B Llama1 model before on a single GPU.

I am wondering this as well

Jul 19 '23 21:07 DolevArtzi

I also think my GPU can handle 13B, how can I try this?

Thanks for sharing the issue, I have a server with 2 GPUs. I tested the example as per your suggestion. I can confirm that It works

torchrun --nproc_per_node 2 example_text_completion.py
--ckpt_dir llama-2-13b/
--tokenizer_path tokenizer.model
--max_seq_len 256 --max_batch_size 4

> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 10.12 seconds
I believe the meaning of life is
> to find the happiness that comes from within.
I believe that if you want to be happy, you have to make a decision. Whether you consciously make that decision or not, you will be happy with some things and unhappy with others.
I believe that you can’t be happy unless you decide

==================================

etc...

Jul 20 '23 01:07 minhtran1309

How to change the world parallel number? I have a machine with 6 gpu and I want to try to run 70B model which requires 8 gpu.

Jul 20 '23 09:07 hzhangxyz

@androiddrew did you manage to run it on 1 GPU? I got the same error and was trying on colab.

Jul 30 '23 15:07 kechan

@kechan I switched to the hugging face 13B and was able to run it on a single 4090 at 8bit quantization. It worked fine.

Jul 30 '23 16:07 androiddrew

How to change the world parallel number? I have a machine with 6 gpu and I want to try to run 70B model which requires 8 gpu.

@hzhangxyz Hi, were you able to run it on your 6-gpu machine?

Sep 06 '23 05:09 Holipori

@kechan I switched to the hugging face 13B and was able to run it on a single 4090 at 8bit quantization. It worked fine.

Did Hugging face model combines the two .pth file in 13B model into one .pth file ?

Sep 07 '23 09:09 manoj21192

Is there any updates on this?

Oct 17 '23 07:10 ysyyork

@kechan I switched to the hugging face 13B and was able to run it on a single 4090 at 8bit quantization. It worked fine.

Can you mention the steps @kechan ?

Nov 16 '23 08:11 asif-ca

How to change the world parallel number? I have a machine with 6 gpu and I want to try to run 70B model which requires 8 gpu.

hit the same issue

Jun 16 '24 13:06 Jackarry188

llama llama copied to clipboard

AssertionError: Loading a checkpoint for MP=2 but world size is 1

llama
llama copied to clipboard