text-generation-webui
text-generation-webui copied to clipboard
Crash with llava 4bit and --auto-devices
Describe the bug
This is related to #1636 - trying to work around VRAM usage on my 12 GB RTX 3060, and using the 4bit model, trying --gpu-memory 7 (since it often wants > 12 GB during inference), and with this some layers get loaded to the CPU. Again when it starts to work on a response, there's a crash. Perhaps this model just doesn't work with cpu offloading - if so please remind me and then this bug can be closed.
The "Output generated" line at the end also shows an impossibly high number for my hardware at least:
Output generated in 4.77 seconds (308.11 tokens/s, 1471 tokens, context 377, seed 1573299585)
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Use --extensions llava --gpu-memory 7 --auto-devices. Upload a picture and enter a prompt.
Screenshot
No response
Logs
Found the following quantized model: /z820/ds2/models/wojtab_llava-13b-v0-4bit-128g/llava-13b-v0-4bit-128g.safetensors
Loading model ...
Done.
Using the following device map for the quantized model: {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 0, 'model.layers.37': 0, 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'}
Loaded the model in 11.06 seconds.
[...]
### Assistant:
--------------------
LLaVA - Embedded 1 image(s) in 2.27s
Traceback (most recent call last):
File "/root/text-generation-webui/modules/callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/root/text-generation-webui/modules/text_generation.py", line 290, in generate_with_callback
shared.model.generate(**kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
outputs = self(
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 280, in pre_forward
set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/utils/offload.py", line 123, in __getitem__
return self.dataset[f"{self.prefix}{key}"]
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/utils/offload.py", line 170, in __getitem__
weight_info = self.index[key]
KeyError: 'model.layers.38.self_attn.q_proj.wf1'
Output generated in 4.77 seconds (308.11 tokens/s, 1471 tokens, context 377, seed 1573299585)
System Info
Ubuntu 22.04
RTX 3060
Looks like with --gpu-memory 7100MiB it starts pushing some layers to cpu.
I'm thinking part of the challenge is with the llava extension active the GPU already has 1.8-2.0 GB or so of VRAM in use, and the --gpu-memory figure is then on top of it.
The line is 7100 - pushes a few layers to the CPU, and 7200 doesn't. However with 7200 (and above) it overruns the 12 GB VRAM with many prompts.
I don't think GPTQ works on CPU, same thing happens with vicuna-13b-4bit
OK - confirmed, --pre_layer is allowing CPU offload to work with GPTQ.
I found a couple other related things:
--auto-devices seems to be unconditionally enabled. I can omit it and if there's not enough GPU RAM it'll still choose to send some layers to the CPU. I'm at the moment confused as to why this doesn't work vs. --pre_layer.
When the llava extension is loaded, that uses about 1.6 GB of VRAM by itself. The --gpu-memory parameter appears to ignore this.
I think you actually can offload it to CPU with --pre-layers. --auto-devices and --gpu-memory are for transformers, maybe transformers offloading breaks gptq-for-llama. As for initial 1.6GB of VRAM - yep, llava is ignoring this switch, it only applies to the LLM, but you can offload the supporting models to CPU in settings.json
OK - thanks for the explanation, and pointer to the README for LLaVA for more info. Closing as looks like it's not a bug, and I'll assume for now the documentation somewhere mentions, or will mention that --auto-devices and --gpu-memory are N/A here.