GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

Error allocating RAM

Open PeterDaGrape opened this issue 1 year ago • 1 comments

Im trying to run text-generation-webui on my computer, I am pretty limited with 8GB RAM, however I have an RTX 3060Ti Im trying to run it on, when running 7B without quantization it will load most of the way there before running out of memory, when running it in 4bit mode Im always getting the error that it is running out of RAM,

`(textgen) PS C:\Users\Peter\llama\text-generation-webui> python server.py --auto-devices --gptq-bits 4 The following models are available:

  1. llama-7b
  2. opt-350m

Which one do you want to load? 1-2

1

Loading llama-7b... Traceback (most recent call last): File "C:\Users\Peter\llama\text-generation-webui\server.py", line 241, in shared.model, shared.tokenizer = load_model(shared.model_name) File "C:\Users\Peter\llama\text-generation-webui\modules\models.py", line 101, in load_model model = load_quantized(model_name) File "C:\Users\Peter\llama\text-generation-webui\modules\GPTQ_loader.py", line 64, in load_quantized model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, shared.args.gptq_pre_layer) File "C:\Users\Peter\llama\text-generation-webui\repositories\GPTQ-for-LLaMa\llama.py", line 232, in load_quant model = LlamaForCausalLM(config) File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 652, in init self.model = LlamaModel(config) File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 457, in init self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]) File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 457, in self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]) File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 274, in init self.mlp = LlamaMLP( File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\transformers\models\llama\modeling_llama.py", line 147, in init self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False) File "C:\Users\Peter\miniconda3\envs\textgen\lib\site-packages\torch\nn\modules\linear.py", line 96, in init self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 90177536 bytes.` Its very surprising considering thats only around 90MB, thanks, any help is appreciated

PeterDaGrape avatar Mar 22 '23 18:03 PeterDaGrape

Can confirm. I have the same issue, but with 64GB RAM, 24GB VRAM and alpaca-30b-lora-int4. RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 238551040 bytes.

kirillsurkov avatar Mar 23 '23 17:03 kirillsurkov

For me, the working solution is to increase the page or swap file to 64 GB. It won't be used, but for some reason it's needed.

kirillsurkov avatar Mar 23 '23 19:03 kirillsurkov

Use swap memory. It requires a lot more memory than you think.

qwopqwop200 avatar Apr 02 '23 03:04 qwopqwop200