Im trying to run text-generation-webui on my computer, I am pretty limited with 8GB RAM, however I have an RTX 3060Ti Im trying to run it on, when running 7B without quantization it will load most of the way there before running out of memory, when running it in 4bit mode Im always getting the error that it is running out of RAM,
`(textgen) PS C:\Users\Peter\llama\text-generation-webui> python server.py --auto-devices --gptq-bits 4
The following models are available:
- llama-7b
- opt-350m
Which one do you want to load? 1-2
1
Loading llama-7b...
Traceback (most recent call last):
File "C:\Users\Peter\llama\text-generation-webui\server.py", line 241, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\Users\Peter\llama\text-generation-webui\modules\models.py", line 101, in load_model
model = load_quantized(model_name)
File "C:\Users\Peter\llama\text-generation-webui\modules\GPTQ_loader.py", line 64, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer)
File "C:\Users\Peter\llama\text-generation-webui\repositories\GPTQ-for-LLaMa\llama.py", line 232, in load_quant
model = LlamaForCausalLM(config)
File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 652, in init
self.model = LlamaModel(config)
File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 457, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 457, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 274, in init
self.mlp = LlamaMLP(
File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 147, in init
self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\linear.py", line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 90177536 bytes.`
Its very surprising considering thats only around 90MB, thanks, any help is appreciated
Can confirm. I have the same issue, but with 64GB RAM, 24GB VRAM and alpaca-30b-lora-int4.
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 238551040 bytes.
For me, the working solution is to increase the page or swap file to 64 GB. It won't be used, but for some reason it's needed.
Use swap memory. It requires a lot more memory than you think.