codellama Inference on multi-gpu

Tried to run:

torchrun --nproc_per_node 1 codellama/example_instructions.py \
     --ckpt_dir /home/ubuntu/model/ \
     --tokenizer_path /home/ubuntu/model/tokenizer.model \
     --max_seq_len 4512 --max_batch_size 4

I have a long prompt (4000 tokens). I have 4 Nvidia A10G each with 300W and 24GB VRAM. However I see only one GPU being used (on nvidia-smi). The error I get is:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 302.00 MiB (GPU 0; 22.19 GiB total capacity; 21.65 GiB already allocated; 175.50 MiB free; 21.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33763) of binary:

The whole tracelog is:

 torchrun --nproc_per_node 1 codellama/example_instructions.py \
>     --ckpt_dir /home/ubuntu/model/ \
>     --tokenizer_path /home/ubuntu/model/tokenizer.model \
>     --max_seq_len 4512 --max_batch_size 4
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1

Loaded in 7.07 seconds
Traceback (most recent call last):
  File "/home/ubuntu/codellama/example_instructions.py", line 114, in <module>
    fire.Fire(main)
  File "/home/ubuntu/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ubuntu/codellama/example_instructions.py", line 97, in main
    results = generator.chat_completion(
  File "/home/ubuntu/codellama/llama/generation.py", line 335, in chat_completion
    generation_tokens, generation_logprobs = self.generate(
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/codellama/llama/generation.py", line 148, in generate
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/codellama/llama/model.py", line 288, in forward
    h = layer(h, start_pos, freqs_cis, mask)
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/codellama/llama/model.py", line 240, in forward
    h = x + self.attention.forward(
  File "/home/ubuntu/codellama/llama/model.py", line 181, in forward
    scores = F.softmax(scores.float(), dim=-1).type_as(xq)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 302.00 MiB (GPU 0; 22.19 GiB total capacity; 21.65 GiB already allocated; 175.50 MiB free; 21.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33763) of binary: /home/ubuntu/venv/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
codellama/example_instructions.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-25_04:41:10
  host      : ip-172-31-92-135.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 33763)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Aug 25 '23 04:08 absin-lt

Same issue. I also want to inference on multi-GPU like using Llama-2 (hf version)

Aug 28 '23 12:08 itaowei

Is the model at /home/ubuntu/model/ the 7B, 13B or 34B version? You may need to adjust the --nproc_per_node parameter to 1, 2 and 4 respectively.

(It is stated here in case you missed it.)

Aug 29 '23 03:08 asolano

Is the model at /home/ubuntu/model/ the 7B, 13B or 34B version? You may need to adjust the --nproc_per_node parameter to 1, 2 and 4 respectively.

(It is stated here in case you missed it.)

Thanks for your reminder. Actually I noticed that point but I still have the question: how to run 13B model on 8 or more GPUs? I tried to adjust the --nproc_per_node parameter to 8 when running the 13B model but it failed.

Aug 29 '23 03:08 itaowei

Good to hear!

IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar.

Aug 29 '23 06:08 asolano

Good to hear!

IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar.

The problem is one input(prompt) can not be split into many chunks. Your mentioned method cannot extend the GPU memory usage in one generation while -huggingface version of Llama-2 can do... When you want to use codellama-13B to generate, 8 * NVIDIA 3090(24GB GPU) cannot be fully used in one generation but Llama-2-13B-hf and StarChat-15B can fully use all GPUs to generate each time.

Aug 29 '23 06:08 itaowei

I see. I'm afraid I am not familiar with that kind of setup, but there is already a HuggingFace version of Code Llama, so you may try running that instead and see if it fits your use case:

https://huggingface.co/docs/transformers/main/model_doc/code_llama

Aug 30 '23 05:08 asolano

Good to hear!

IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar.

Can you please guide me how to run 13B and 34B model on Windows? I have single GPU and hence able to run 7B model whose Model parallel value=1. 13B model requires MP value=2 but I have only 1 GPU on which I want to to inference, what changes should I make in code and in which file so that I can run 13B model?

Sep 05 '23 05:09 manoj21192

The model parallel size (MP) is fixed. Which is:

7b: 1
13b: 2
34b: 4

Sadly when you don't change the llama loading code, you have to set num_gpus(n_procs_per_node) equal to MP size. If you want to load like codellama-13b on 8 gpus, you should change the loading code in llama/generation.py. Here is a rough example for loading codellama-7b on 2 gpus using deepspeed framework:

model = Transformer(model_args)
checkpoint = torch.load(ckpt_dir + '/consolidated.00.pth', map_location="cpu")
deepspeed_generator, _, _, _ = deepspeed.initialize(model=model, model_parameters=checkpoint, config={
"fp16": {
    "enabled": True
}})
model = Llama(deepspeed_generator, tokenizer)

PS. If anyone finds an open source framework, please share it with me, Tks in advance. (╥﹏╥)

Oct 04 '23 15:10 CODE-FOR

The model parallel size (MP) is fixed. Which is:

7b: 1

13b: 2

34b: 4

Sadly when you don't change the llama loading code, you have to set num_gpus(n_procs_per_node) equal to MP size. If you want to load like codellama-13b on 8 gpus, you should change the loading code in llama/generation.py. Here is a rough example for loading codellama-7b on 2 gpus using deepspeed framework:
model = Transformer(model_args)
checkpoint = torch.load(ckpt_dir + '/consolidated.00.pth', map_location="cpu")
deepspeed_generator, _, _, _ = deepspeed.initialize(model=model, model_parameters=checkpoint, config={
"fp16": {
    "enabled": True
}})
model = Llama(deepspeed_generator, tokenizer)
PS. If anyone finds an open source framework, please share it with me, Tks in advance. (╥﹏╥)

Could you please give an exapmle of full code in llama/generation.py. I try to change the code, but i have the same error with memory allocation. I want to use codellama-7b on 2 gpus.

Nov 19 '23 20:11 LinkaG

codellama codellama copied to clipboard

Inference on multi-gpu

codellama
codellama copied to clipboard