llama
llama copied to clipboard
with RTX 4070 12 GB it is giving me CUDA out of memory error
I am trying to understand what am I doing wrong here?
Is it true that even smallest size of any llama2 model is 13 Gig (llama-2-7b/consolidated.00.pth) ? And that is the reason it is not working in my 12 Gig 4070 Nvidia GPU?
Is there any any workaround?
Here is the error I am receiving.
`idea@myidea:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_text_completion.py
--ckpt_dir llama-2-7b/
--tokenizer_path tokenizer.model
--max_seq_len 128 --max_batch_size 4
initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/home/idea/dhruvil/git/llama/example_text_completion.py", line 55, in
fire.Fire(main) File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/idea/dhruvil/git/llama/example_text_completion.py", line 18, in main generator = Llama.build( File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build model = Transformer(model_args) File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init self.layers.append(TransformerBlock(layer_id, params)) File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init self.feed_forward = FeedForward( File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init self.w3 = ColumnParallelLinear( File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init self.weight = Parameter(torch.Tensor(self.output_size_per_partition, self.in_features)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.72 GiB total capacity; 10.93 GiB already allocated; 59.19 MiB free; 10.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 330097) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/idea/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ` ============================================================ example_text_completion.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-07-20_16:08:32 host : myidea rank : 0 (local_rank: 0) exitcode : 1 (pid: 330097) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Yes, I think that the minimum vram for 7b is 16 GB
I think it should work. I tried with a Ryzen 3600X, 32GB RAM, 1070TI 8GB and its works.
Did you try with 3 GPUs together? or individually?
For mine, it doesn't work individually.
Can you tell how did you make it work for one GPU?
Dhruvil
On Thu, 20 Jul 2023 at 16:27, Aron de Castro @.***> wrote:
I tried with a Ryzen 3600X, 32GB RAM, 1070TI 8GB and its works.
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644782571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVQWGTH64YOBD3LDCATXRG5ENANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>
-- Thank you,
Dhruvil
Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***
Individually.
I think I have not done anything different.
I used Windows WSL Ubuntu.
-
I installed CUDA toolkit 11.7
-
I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
-
And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"
Interesting!!
I am doing the same thing,
then it still gives me this error. I am not sure how to debug this forward anymore. I applied same package as yours pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
@.**:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4
initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
/home/idea/.local/lib/python3.10/site-packages/torch/init.py:615: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Traceback (most recent call last):
File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 73,
in
fire.Fire(main)
File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context,
name)
File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 20, in main
generator = Llama.build(
File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build
model = Transformer(model_args)
File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init
self.layers.append(TransformerBlock(layer_id, params))
File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init
self.feed_forward = FeedForward(
File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init
self.w3 = ColumnParallelLinear(
File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init
self.weight = Parameter(torch.Tensor(self.output_size_per_partition,
self.in_features))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.72 GiB of which 93.19 MiB is free. Including non-PyTorch memory, this process has 11.43 GiB memory in use. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 1.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-07-20 18:45:19,855] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 330867) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/idea/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-07-20_18:45:19
host : myidea
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 330867)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@.**:~/dhruvil/git/llama$
On Thu, 20 Jul 2023 at 17:08, Aron de Castro @.***> wrote:
Individually.
I think I have not done anything different.
I used Windows WSL Ubuntu.
I installed CUDA toolkit 11.7 2.
I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117 3.
And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644806780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVV6LHDXAQKUSUMSZJ3XRHB7PANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>
-- Thank you,
Dhruvil
Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***
This is how my nvidia-smi looks like .
I have 4070 with 12 Gig.
@.**:~/dhruvil/git/llama$ nvidia-smi
Thu Jul 20 18:47:14 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4070 Off| 00000000:04:00.0 Off | N/A |
| 0% 40C P8 2W / 200W| 197MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1834 G /usr/lib/xorg/Xorg 155MiB |
| 0 N/A N/A 1978 G /usr/bin/gnome-shell 11MiB |
| 0 N/A N/A 52927 G ...76579054,1620300079093577791,262144 25MiB |
| 0 N/A N/A 178873 G gnome-control-center 2MiB |
+---------------------------------------------------------------------------------------+
@.**:~/dhruvil/git/llama$
On Thu, 20 Jul 2023 at 18:46, Dhruvil Darji @.***> wrote:
Interesting!!
I am doing the same thing,
then it still gives me this error. I am not sure how to debug this forward anymore. I applied same package as yours pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
@.**:~/dhruvil/git/llama$ torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4
initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
/home/idea/.local/lib/python3.10/site-packages/torch/init.py:615: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Traceback (most recent call last):
File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 73, in
fire.Fire(main)File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context,name)
File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(File "/home/idea/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)File "/home/idea/dhruvil/git/llama/example_chat_completion.py", line 20, in main
generator = Llama.build(File "/home/idea/dhruvil/git/llama/llama/generation.py", line 96, in build
model = Transformer(model_args)File "/home/idea/dhruvil/git/llama/llama/model.py", line 259, in init
self.layers.append(TransformerBlock(layer_id, params))File "/home/idea/dhruvil/git/llama/llama/model.py", line 222, in init
self.feed_forward = FeedForward(File "/home/idea/dhruvil/git/llama/llama/model.py", line 207, in init
self.w3 = ColumnParallelLinear(File "/home/idea/.local/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 262, in init
self.weight = Parameter(torch.Tensor(self.output_size_per_partition,self.in_features))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 11.72 GiB of which 93.19 MiB is free. Including non-PyTorch memory, this process has 11.43 GiB memory in use. Of the allocated memory 10.77 GiB is allocated by PyTorch, and 1.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-07-20 18:45:19,855] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 330867) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/idea/.local/bin/torchrun", line 8, in
sys.exit(main())File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main
run(args)File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run
elastic_launch(File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))File "/home/idea/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-07-20_18:45:19
host : myidea
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 330867)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@.**:~/dhruvil/git/llama$
On Thu, 20 Jul 2023 at 17:08, Aron de Castro @.***> wrote:
Individually.
I think I have not done anything different.
I used Windows WSL Ubuntu.
I installed CUDA toolkit 11.7 2.
I installed the requirements, but I used a different torch package -> pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117 3.
And I tested "torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4"
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/466#issuecomment-1644806780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGEBVVV6LHDXAQKUSUMSZJ3XRHB7PANCNFSM6AAAAAA2SCVECE . You are receiving this because you authored the thread.Message ID: @.***>
-- Thank you,
Dhruvil
Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***
-- Thank you,
Dhruvil
Dhruvil A Darji | Loyola Marymount University | Electrical Engineering | Graduate Student ( direct +1(424)-393-7267 *** @.***
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.01 Driver Version: 536.67 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1070 Ti On | 00000000:07:00.0 On | N/A | | 0% 43C P5 10W / 180W | 373MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
I am not expert with this, but maybe the cuda cores amount can require more memory, just sharing my thoughts. My model is pretty old now.
If you want to try llama with a cpu installation you can install this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama
Complete process to install :
- download the original version of Llama from :
https://github.com/facebookresearch/llamaand extract it to allama-mainfolder - download th cpu version from :
https://github.com/krychu/llamaand extract it and replace files in thellama-mainfolder - run the
download.shscript in a terminal, passing the URL provided when prompted to start the download - go to the
llama-mainfolder - cretate an Python3 env :
python3 -m venv envand activate it :source env/bin/activate - install the cpu version of pytorch :
python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu - install dependencies off llama :
python3 -m pip install -e . - run if you have downloaded llama-2-7b :
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir llama-2-7b/ \
--tokenizer_path tokenizer.model \
--max_seq_len 128 --max_batch_size 1 #(instead of 4)
- I
I tried with RTX 2060 8GB and 64GB RAM and it doesn't work. I am impressed that you were able to deploy it on local PC.
@dhruvildarji Was you able to solve the issue? I am trying to run on RTX 4070 12 GB in Ubuntu and have same issue