DefaultCPUAllocator: not enough memory: you tried to allocate 22151168 bytes - llama-4bit
I get the following error when trying to load 30b-llama-4bit:
(textgen) D:\text-generation-webui>python server.py --auto-devices --no-stream --cai-chat --load-in-4bit --gpu-memory 21
Loading the extension "gallery"... Ok.
Loading llama-30b...
Loading model ...
Traceback (most recent call last):
File "D:\text-generation-webui\server.py", line 194, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "D:\text-generation-webui\modules\models.py", line 119, in load_model
model = load_quant(path_to_model, Path(f"models/{pt_model}"), 4)
File "D:\text-generation-webui\repositories\GPTQ-for-LLaMa\llama.py", line 241, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 789, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1131, in _load
result = unpickler.load()
File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1101, in persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1079, in load_tensor
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 22151168 bytes.
I have 32GB RAM, 12700k, 4090. The python process peaks at 20GB RAM usage and then crashes regardless of the --gpu-memory parameter. Also I am not sure why it even says CPUAllocator? I want to load it at with my GPU.
same here, i have exactly the same problem
Same issue here, both on windows and WSL(manual anaconda).I peak at 30GB vram then the process crashes.
It seems that it needs to allocate a lot of system RAM but not actually use it. Just set a large swap/virtual memory and you can run llama in 4bit. The virtual memory will not be used.
same here
it turns out loading the 30B 4bit model needs over 90G ram, make sure your swap is big enough.
@awatuna I have 32gb RAM and 32gb swap and I can load the 30b 4-bit model.
Increasing the swap size fixed the issue.
how do i increase my swap size?! what commads are used?
For windows you can do it under system properties>advanced>performance>settings> and here you can specify swap size for the disk you have webui installed on and then restart. For linux, you will have to google that yourself.
@oobabooga Windows allocates swap for committed memory. I have 64G with 8G swap and it fails right away.
Set swap to auto and free up the swap drive to let it grow, load 30B 4bit, the committed memory grow from under 4G (just rebooted) peaked over 97G before it dropped, the actual in use memory is low, but the huge committed needs swap allocated or it fails.