codellama
codellama copied to clipboard
Inference on multi-gpu
Tried to run:
torchrun --nproc_per_node 1 codellama/example_instructions.py \
--ckpt_dir /home/ubuntu/model/ \
--tokenizer_path /home/ubuntu/model/tokenizer.model \
--max_seq_len 4512 --max_batch_size 4
I have a long prompt (4000 tokens). I have 4 Nvidia A10G each with 300W and 24GB VRAM. However I see only one GPU being used (on nvidia-smi). The error I get is:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 302.00 MiB (GPU 0; 22.19 GiB total capacity; 21.65 GiB already allocated; 175.50 MiB free; 21.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33763) of binary:
The whole tracelog is:
torchrun --nproc_per_node 1 codellama/example_instructions.py \
> --ckpt_dir /home/ubuntu/model/ \
> --tokenizer_path /home/ubuntu/model/tokenizer.model \
> --max_seq_len 4512 --max_batch_size 4
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 7.07 seconds
Traceback (most recent call last):
File "/home/ubuntu/codellama/example_instructions.py", line 114, in <module>
fire.Fire(main)
File "/home/ubuntu/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ubuntu/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/ubuntu/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/ubuntu/codellama/example_instructions.py", line 97, in main
results = generator.chat_completion(
File "/home/ubuntu/codellama/llama/generation.py", line 335, in chat_completion
generation_tokens, generation_logprobs = self.generate(
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/codellama/llama/generation.py", line 148, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/codellama/llama/model.py", line 288, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/codellama/llama/model.py", line 240, in forward
h = x + self.attention.forward(
File "/home/ubuntu/codellama/llama/model.py", line 181, in forward
scores = F.softmax(scores.float(), dim=-1).type_as(xq)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 302.00 MiB (GPU 0; 22.19 GiB total capacity; 21.65 GiB already allocated; 175.50 MiB free; 21.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33763) of binary: /home/ubuntu/venv/bin/python
Traceback (most recent call last):
File "/home/ubuntu/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
codellama/example_instructions.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-25_04:41:10
host : ip-172-31-92-135.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 33763)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Same issue. I also want to inference on multi-GPU like using Llama-2 (hf version)
Is the model at /home/ubuntu/model/
the 7B, 13B or 34B version? You may need to adjust the --nproc_per_node
parameter to 1, 2 and 4 respectively.
(It is stated here in case you missed it.)
Is the model at
/home/ubuntu/model/
the 7B, 13B or 34B version? You may need to adjust the--nproc_per_node
parameter to 1, 2 and 4 respectively.(It is stated here in case you missed it.)
Thanks for your reminder.
Actually I noticed that point but I still have the question: how to run 13B model on 8 or more GPUs? I tried to adjust the --nproc_per_node
parameter to 8 when running the 13B model but it failed.
Good to hear!
IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar.
Good to hear!
IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar.
The problem is one input(prompt) can not be split into many chunks. Your mentioned method cannot extend the GPU memory usage in one generation while -huggingface
version of Llama-2
can do... When you want to use codellama-13B
to generate, 8 * NVIDIA 3090(24GB GPU) cannot be fully used in one generation but Llama-2-13B-hf
and StarChat-15B
can fully use all GPUs to generate each time.
I see. I'm afraid I am not familiar with that kind of setup, but there is already a HuggingFace version of Code Llama, so you may try running that instead and see if it fits your use case:
https://huggingface.co/docs/transformers/main/model_doc/code_llama
Good to hear!
IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar.
Can you please guide me how to run 13B and 34B model on Windows? I have single GPU and hence able to run 7B model whose Model parallel value=1. 13B model requires MP value=2 but I have only 1 GPU on which I want to to inference, what changes should I make in code and in which file so that I can run 13B model?
The model parallel size (MP) is fixed. Which is:
- 7b: 1
- 13b: 2
- 34b: 4
Sadly when you don't change the llama loading code, you have to set num_gpus(n_procs_per_node) equal to MP size. If you want to load like codellama-13b on 8 gpus, you should change the loading code in llama/generation.py. Here is a rough example for loading codellama-7b on 2 gpus using deepspeed framework:
model = Transformer(model_args)
checkpoint = torch.load(ckpt_dir + '/consolidated.00.pth', map_location="cpu")
deepspeed_generator, _, _, _ = deepspeed.initialize(model=model, model_parameters=checkpoint, config={
"fp16": {
"enabled": True
}})
model = Llama(deepspeed_generator, tokenizer)
PS. If anyone finds an open source framework, please share it with me, Tks in advance. (╥﹏╥)
The model parallel size (MP) is fixed. Which is:
- 7b: 1
- 13b: 2
- 34b: 4
Sadly when you don't change the llama loading code, you have to set num_gpus(n_procs_per_node) equal to MP size. If you want to load like codellama-13b on 8 gpus, you should change the loading code in llama/generation.py. Here is a rough example for loading codellama-7b on 2 gpus using deepspeed framework:
model = Transformer(model_args) checkpoint = torch.load(ckpt_dir + '/consolidated.00.pth', map_location="cpu") deepspeed_generator, _, _, _ = deepspeed.initialize(model=model, model_parameters=checkpoint, config={ "fp16": { "enabled": True }}) model = Llama(deepspeed_generator, tokenizer)
PS. If anyone finds an open source framework, please share it with me, Tks in advance. (╥﹏╥)
Could you please give an exapmle of full code in llama/generation.py. I try to change the code, but i have the same error with memory allocation. I want to use codellama-7b on 2 gpus.