I am trying to understand what am I doing wrong here?

Is it true that even smallest size of any llama2 model is 13 Gig (llama-2-7b/consolidated.00.pth) ? And that is the reason it is not working in my 12 Gig 4070 Nvidia GPU?

Is there any any workaround?

Here is the error I am receiving.

`idea@myidea:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_text_completion.py
--ckpt_dir llama-2-7b/
--tokenizer_path tokenizer.model
--max_seq_len 128 --max_batch_size 4

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/home/idea/dhruvil/git/llama/example_text_completion.py", line 55, in fire.Fire(main) File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/idea/dhruvil/git/llama/example_text_completion.py", line 18, in main generator = Llama.build( File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build model = Transformer(model_args) File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init self.layers.append(TransformerBlock(layer_id, params)) File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init self.feed_forward = FeedForward( File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init self.w3 = ColumnParallelLinear( File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init self.weight = Parameter(torch.Tensor(self.output_size_per_partition, self.in_features)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.72 GiB total capacity; 10.93 GiB already allocated; 59.19 MiB free; 10.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 330097) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/idea/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ` ============================================================ example_text_completion.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-20_16:08:32 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330097) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Jul 20 '23 23:07 dhruvildarji

Yes, I think that the minimum vram for 7b is 16 GB

Jul 20 '23 23:07 realhaik

I think it should work. I tried with a Ryzen 3600X, 32GB RAM, 1070TI 8GB and its works.

Jul 20 '23 23:07 aroncds

Did you try with 3 GPUs together? or individually?

For mine, it doesn't work individually.

Can you tell how did you make it work for one GPU?

Dhruvil

On Thu, 20 Jul 2023 at 16:27, Aron de Castro @.***> wrote:

I tried with a Ryzen 3600X, 32GB RAM, 1070TI 8GB and its works.

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644782571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVQWGTH64YOBD3LDCATXRG5ENANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>

-- Thank you,

Dhruvil

Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***

Jul 20 '23 23:07 dhruvildarji

Individually.

I think I have not done anything different.

I used Windows WSL Ubuntu.

I installed CUDA toolkit 11.7
I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"

Jul 21 '23 00:07 aroncds

Interesting!!

I am doing the same thing,

then it still gives me this error. I am not sure how to debug this forward anymore. I applied same package as yours pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117

@.**:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

initializing model parallel with size 1

initializing ddp with size 1

initializing pipeline with size 1

/home/idea/.local/lib/python3.10/site-packages/torch/init.py:615: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)

_C._set_default_tensor_type(t)

Traceback (most recent call last):

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 73, in

fire.Fire(main)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire

component_trace = _Fire(component, args, parsed_flag_args, context,

name)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire

component, remaining_args = _CallAndUpdateTrace(

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace

component = fn(*varargs, **kwargs)

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 20, in main

generator = Llama.build(

File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build

model = Transformer(model_args)

File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init

self.layers.append(TransformerBlock(layer_id, params))

File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init

self.feed_forward = FeedForward(

File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init

self.w3 = ColumnParallelLinear(

File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init

self.weight = Parameter(torch.Tensor(self.output_size_per_partition,

self.in_features))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.72 GiB of which 93.19 MiB is free. Including non-PyTorch memory, this process has 11.43 GiB memory in use. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 1.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[2023-07-20 18:45:19,855] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 330867) of binary: /usr/bin/python3

Traceback (most recent call last):

File "/home/idea/.local/bin/torchrun", line 8, in

sys.exit(main())

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper

return f(*args, **kwargs)

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main

run(args)

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run

elastic_launch(

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

example_chat_completion.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:

time : 2023-07-20_18:45:19

host : myidea

rank : 0 (local_rank: 0)

exitcode : 1 (pid: 330867)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================

@.**:~/dhruvil/git/llama$

On Thu, 20 Jul 2023 at 17:08, Aron de Castro @.***> wrote:

Individually.

I think I have not done anything different.

I used Windows WSL Ubuntu.

I installed CUDA toolkit 11.7 2.

I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117 3.

And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644806780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVV6LHDXAQKUSUMSZJ3XRHB7PANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>

-- Thank you,

Dhruvil

Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***

Jul 21 '23 01:07 dhruvildarji

This is how my nvidia-smi looks like .

I have 4070 with 12 Gig.

@.**:~/dhruvil/git/llama$ nvidia-smi

Thu Jul 20 18:47:14 2023

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA GeForce RTX 4070 Off| 00000000:04:00.0 Off | N/A |

| 0% 40C P8 2W / 200W| 197MiB / 12282MiB | 0% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

| 0 N/A N/A 1834 G /usr/lib/xorg/Xorg 155MiB |

| 0 N/A N/A 1978 G /usr/bin/gnome-shell 11MiB |

| 0 N/A N/A 52927 G ...76579054,1620300079093577791,262144 25MiB |

| 0 N/A N/A 178873 G gnome-control-center 2MiB |

+---------------------------------------------------------------------------------------+

@.**:~/dhruvil/git/llama$

On Thu, 20 Jul 2023 at 18:46, Dhruvil Darji @.***> wrote:

Interesting!!

I am doing the same thing,

then it still gives me this error. I am not sure how to debug this forward anymore. I applied same package as yours pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117

@.**:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

initializing model parallel with size 1

initializing ddp with size 1

initializing pipeline with size 1

/home/idea/.local/lib/python3.10/site-packages/torch/init.py:615: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)

_C._set_default_tensor_type(t)

Traceback (most recent call last):

File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 73, in
fire.Fire(main)
File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context,
name)

File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 20, in main
generator = Llama.build(
File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build
model = Transformer(model_args)
File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init
self.layers.append(TransformerBlock(layer_id, params))
File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init
self.feed_forward = FeedForward(
File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init
self.w3 = ColumnParallelLinear(
File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init
self.weight = Parameter(torch.Tensor(self.output_size_per_partition,
self.in_features))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.72 GiB of which 93.19 MiB is free. Including non-PyTorch memory, this process has 11.43 GiB memory in use. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 1.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[2023-07-20 18:45:19,855] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 330867) of binary: /usr/bin/python3

Traceback (most recent call last):

File "/home/idea/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

example_chat_completion.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:

time : 2023-07-20_18:45:19

host : myidea

rank : 0 (local_rank: 0)

exitcode : 1 (pid: 330867)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================

@.**:~/dhruvil/git/llama$

On Thu, 20 Jul 2023 at 17:08, Aron de Castro @.***> wrote:

Individually.

I think I have not done anything different.

I used Windows WSL Ubuntu.

I installed CUDA toolkit 11.7 2.

I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117 3.

And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644806780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVV6LHDXAQKUSUMSZJ3XRHB7PANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>

-- Thank you,

Dhruvil

Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***

-- Thank you,

Dhruvil

Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***

Jul 21 '23 01:07 dhruvildarji

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

Jul 21 '23 14:07 aroncds

I am not expert with this, but maybe the cuda cores amount can require more memory, just sharing my thoughts. My model is pretty old now.

Jul 21 '23 14:07 aroncds

If you want to try llama with a cpu installation you can install this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

download the original version of Llama from : https://github.com/facebookresearch/llama and extract it to a llama-main folder
download th cpu version from : https://github.com/krychu/llama and extract it and replace files in the llama-main folder
run the download.sh script in a terminal, passing the URL provided when prompted to start the download
go to the llama-main folder
cretate an Python3 env : python3 -m venv env and activate it : source env/bin/activate
install the cpu version of pytorch : python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
install dependencies off llama : python3 -m pip install -e .
run if you have downloaded llama-2-7b :

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 1 #(instead of 4)

Jul 25 '23 14:07 pzim-devdata

I

I tried with RTX 2060 8GB and 64GB RAM and it doesn't work. I am impressed that you were able to deploy it on local PC.

Sep 29 '23 15:09 hdnh2006

@dhruvildarji Was you able to solve the issue? I am trying to run on RTX 4070 12 GB in Ubuntu and have same issue

Nov 25 '23 01:11 lolevsky

llama
llama copied to clipboard

with RTX 4070 12 GB it is giving me CUDA out of memory error

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-20_16:08:32 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330097) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

llama llama copied to clipboard

with RTX 4070 12 GB it is giving me CUDA out of memory error

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-07-20_16:08:32 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330097) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

llama
llama copied to clipboard