text-generation-webui DefaultCPUAllocator: not enough memory: you tried to allocate 22151168 bytes

I get the following error when trying to load 30b-llama-4bit:

(textgen) D:\text-generation-webui>python server.py --auto-devices --no-stream --cai-chat --load-in-4bit --gpu-memory 21

Loading the extension "gallery"... Ok.
Loading llama-30b...
Loading model ...
Traceback (most recent call last):
  File "D:\text-generation-webui\server.py", line 194, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "D:\text-generation-webui\modules\models.py", line 119, in load_model
    model = load_quant(path_to_model, Path(f"models/{pt_model}"), 4)
  File "D:\text-generation-webui\repositories\GPTQ-for-LLaMa\llama.py", line 241, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 789, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1131, in _load
    result = unpickler.load()
  File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1101, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "C:\Users\inaba\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1079, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 22151168 bytes.

I have 32GB RAM, 12700k, 4090. The python process peaks at 20GB RAM usage and then crashes regardless of the --gpu-memory parameter. Also I am not sure why it even says CPUAllocator? I want to load it at with my GPU.

Mar 11 '23 11:03 christo-achkov

same here, i have exactly the same problem

Mar 11 '23 14:03 Flonixcorn

Same issue here, both on windows and WSL(manual anaconda).I peak at 30GB vram then the process crashes.

Mar 11 '23 17:03 FGhrawi

It seems that it needs to allocate a lot of system RAM but not actually use it. Just set a large swap/virtual memory and you can run llama in 4bit. The virtual memory will not be used.

Mar 12 '23 13:03 sgsdxzy

same here

Mar 13 '23 11:03 jiahao2333

it turns out loading the 30B 4bit model needs over 90G ram, make sure your swap is big enough.

Mar 13 '23 16:03 awatuna

@awatuna I have 32gb RAM and 32gb swap and I can load the 30b 4-bit model.

Mar 13 '23 21:03 oobabooga

Increasing the swap size fixed the issue.

Mar 14 '23 10:03 christo-achkov

how do i increase my swap size?! what commads are used?

Mar 14 '23 10:03 IIIIIIIllllllllIIIII

For windows you can do it under system properties>advanced>performance>settings> and here you can specify swap size for the disk you have webui installed on and then restart. For linux, you will have to google that yourself.

Mar 14 '23 11:03 christo-achkov

@oobabooga Windows allocates swap for committed memory. I have 64G with 8G swap and it fails right away.

Set swap to auto and free up the swap drive to let it grow, load 30B 4bit, the committed memory grow from under 4G (just rebooted) peaked over 97G before it dropped, the actual in use memory is low, but the huge committed needs swap allocated or it fails.

Mar 14 '23 11:03 awatuna

DefaultCPUAllocator: not enough memory: you tried to allocate 22151168 bytes - llama-4bit