text-generation-webui Crash with llava 4bit and --auto-devices

Describe the bug

This is related to #1636 - trying to work around VRAM usage on my 12 GB RTX 3060, and using the 4bit model, trying --gpu-memory 7 (since it often wants > 12 GB during inference), and with this some layers get loaded to the CPU. Again when it starts to work on a response, there's a crash. Perhaps this model just doesn't work with cpu offloading - if so please remind me and then this bug can be closed.

The "Output generated" line at the end also shows an impossibly high number for my hardware at least:

Output generated in 4.77 seconds (308.11 tokens/s, 1471 tokens, context 377, seed 1573299585)

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Use --extensions llava --gpu-memory 7 --auto-devices. Upload a picture and enter a prompt.

Screenshot

No response

Logs

Found the following quantized model: /z820/ds2/models/wojtab_llava-13b-v0-4bit-128g/llava-13b-v0-4bit-128g.safetensors
Loading model ...
Done.
Using the following device map for the quantized model: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 0, 'model.layers.37': 0, 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'}
Loaded the model in 11.06 seconds.


[...]
### Assistant: 
--------------------

LLaVA - Embedded 1 image(s) in 2.27s
Traceback (most recent call last):
  File "/root/text-generation-webui/modules/callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/root/text-generation-webui/modules/text_generation.py", line 290, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/utils/offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/utils/offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.38.self_attn.q_proj.wf1'
Output generated in 4.77 seconds (308.11 tokens/s, 1471 tokens, context 377, seed 1573299585)

System Info

Ubuntu 22.04
RTX 3060

Apr 28 '23 19:04 dblacknc

Looks like with --gpu-memory 7100MiB it starts pushing some layers to cpu.

I'm thinking part of the challenge is with the llava extension active the GPU already has 1.8-2.0 GB or so of VRAM in use, and the --gpu-memory figure is then on top of it.

Apr 28 '23 19:04 dblacknc

The line is 7100 - pushes a few layers to the CPU, and 7200 doesn't. However with 7200 (and above) it overruns the 12 GB VRAM with many prompts.

Apr 28 '23 19:04 dblacknc

I don't think GPTQ works on CPU, same thing happens with vicuna-13b-4bit

Apr 29 '23 22:04 Wojtab

OK - confirmed, --pre_layer is allowing CPU offload to work with GPTQ.

I found a couple other related things:

--auto-devices seems to be unconditionally enabled. I can omit it and if there's not enough GPU RAM it'll still choose to send some layers to the CPU. I'm at the moment confused as to why this doesn't work vs. --pre_layer.

When the llava extension is loaded, that uses about 1.6 GB of VRAM by itself. The --gpu-memory parameter appears to ignore this.

Apr 29 '23 23:04 dblacknc

I think you actually can offload it to CPU with --pre-layers. --auto-devices and --gpu-memory are for transformers, maybe transformers offloading breaks gptq-for-llama. As for initial 1.6GB of VRAM - yep, llava is ignoring this switch, it only applies to the LLM, but you can offload the supporting models to CPU in settings.json

Apr 29 '23 23:04 Wojtab

OK - thanks for the explanation, and pointer to the README for LLaVA for more info. Closing as looks like it's not a bug, and I'll assume for now the documentation somewhere mentions, or will mention that --auto-devices and --gpu-memory are N/A here.

Apr 29 '23 23:04 dblacknc

text-generation-webui text-generation-webui copied to clipboard

Crash with llava 4bit and --auto-devices

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard