text-generation-webui
text-generation-webui copied to clipboard
4bit LoRA "--monkey-patch" breaks "--gpu-memory" Model Splitting for Multi-GPU
Describe the bug
Trying to apply kuleshov/llama-65b-4bit to Neko-Institute-of-Science/LLaMA-65B-4bit-128g
--monkey-patch
seems to ignore --gpu-memory
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
python server.py --wbits 4 --gpu-memory 15 19 --listen --model llama-65b --lora alpaca-65b-4bit --verbose --chat --groupsize 128 --no-fused_mlp --monkey-patch
Screenshot
No response
Logs
(textgen1) user@hostname:~/Documents/ooba/1/text-generation-webui$ python server.py --wbits 4 --gpu-memory 15 19 --listen --model llama-65b --lora alpaca-65b-4bit --verbose --chat --groupsize 128 --no-fused_mlp --monkey-patch
bin /home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
Loading llama-65b...
Warning: applying the monkey patch for using LoRAs in 4-bit mode.
It may cause undefined behavior outside its intended scope.
Loading Model ...
The safetensors archive passed at models/llama-65b-4bit.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Traceback (most recent call last):
File "/home/user/Documents/ooba/1/text-generation-webui/server.py", line 905, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/user/Documents/ooba/1/text-generation-webui/modules/models.py", line 110, in load_model
model, tokenizer = load_model_llama(model_name)
File "/home/user/Documents/ooba/1/text-generation-webui/modules/monkey_patch_gptq_lora.py", line 23, in load_model_llama
model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=False)
File "/home/user/Documents/ooba/1/text-generation-webui/repositories/alpaca_lora_4bit/autograd_4bit.py", line 202, in load_llama_model_4bit_low_ram
model = accelerate.load_checkpoint_and_dispatch(
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 924, in load_checkpoint_in_model
checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 822, in load_state_dict
tensors[key] = f.get_tensor(key)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.01 GiB already allocated; 4.81 MiB free; 23.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Info
...-:::::-... user@hostname
.-MMMMMMMMMMMMMMM-. -------------
.-MMMM`..-:::::::-..`MMMM-. OS: Linux Mint 21.1 x86_64
.:MMMM.:MMMMMMMMMMMMMMM:.MMMM:. Kernel: 5.15.0-69-generic
-MMM-M---MMMMMMMMMMMMMMMMMMM.MMM- Uptime: 3 hours, 12 mins
`:MMM:MM` :MMMM:....::-...-MMMM:MMM:` Packages: 2444 (dpkg)
:MMM:MMM` :MM:` `` `` `:MMM:MMM: Shell: bash 5.1.16
.MMM.MMMM` :MM. -MM. .MM- `MMMM.MMM. Resolution: 2560x1440
:MMM:MMMM` :MM. -MM- .MM: `MMMM-MMM: DE: Cinnamon 5.6.8
:MMM:MMMM` :MM. -MM- .MM: `MMMM:MMM: WM: Mutter (Muffin)
:MMM:MMMM` :MM. -MM- .MM: `MMMM-MMM: WM Theme: Mint-Y-Dark-Aqua (Mint-Y)
.MMM.MMMM` :MM:--:MM:--:MM: `MMMM.MMM. Theme: Mint-Y-Dark-Aqua [GTK2/3]
:MMM:MMM- `-MMMMMMMMMMMM-` -MMM-MMM: Icons: Mint-Y-Dark-Aqua [GTK2/3]
:MMM:MMM:` `:MMM:MMM: Terminal: gnome-terminal
.MMM.MMMM:--------------:MMMM.MMM. CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz
'-MMMM.-MMMMMMMMMMMMMMM-.MMMM-' GPU: NVIDIA GeForce RTX 3090
'.-MMMM``--:::::--``MMMM-.' GPU: NVIDIA GeForce RTX 3090
'-MMMMMMMMMMMMM-' Memory: 9801MiB / 80344MiB
``-:::::-``
You need to use different functions than the monkey patch to do offloading. This is why I did not like that approach.
if shared.args.gpu_memory or torch.cuda.device_count() > 1:
model, tokenizer = load_llama_model_4bit_low_ram_and_offload(str(path_to_model), str(pt_path), lora_path=None, groupsize=shared.args.groupsize, seqlen=2048, max_memory=calculate_device_mem(), is_v1_model=shared.args.v1)
else:
I only have CPU but perhaps device map might also have to be passed into the function and that is an upstream change.. it does seem to do exactly what the devicemap in gptq_loader does though.
Ok so I managed to get the model to LOAD by using your suggestion
just change monkey_patch_gptq_lora.py
as indicated below
def load_model_llama(model_name):
config_path = str(Path(f'{shared.args.model_dir}/{model_name}'))
model_path = str(find_quantized_model_file(model_name))
> max_memory = {i: f"{shared.args.gpu_memory[i]}GiB" for i in range(len(shared.args.gpu_memory))}
> if shared.args.gpu_memory or torch.cuda.device_count() > 1:
> model, tokenizer = load_llama_model_4bit_low_ram_and_offload(
> config_path,
> model_path,
> lora_path=str(Path(f'{shared.args.lora_dir}/{shared.args.lora}')),
> groupsize=shared.args.groupsize,
> seqlen=2048,
> max_memory=max_memory,
> is_v1_model=False,
> )
> else:
> model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=False)
for n, m in model.named_modules():
And while it's nice to LOAD the model on two GPUs I can't do much with it as I'm still met with the following error when I try to generate anything:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA_gather)
error...
I guess loading it in is a start right?... heh
I thought that not splitting "LlamaDecoderLayer" was enough is it not? I only did offloading to CPU with this.
I thought that not splitting "LlamaDecoderLayer" was enough is it not? I only did offloading to CPU with this.
If by not splitting "LlamaDecodeLayer" you mean modifying autograd_4bit.py
on this line
print('Dispatching model ...')
> device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True, main_device=0)
to this:
print('Dispatching model ...')
> device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory)
model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True, main_device=0)
Then no... no it does not change things... it still errors in text_generation.py
Traceback (most recent call last):
File "/home/user/Documents/ooba/1/text-generation-webui/modules/callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/user/Documents/ooba/1/text-generation-webui/modules/text_generation.py", line 252, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/generation/utils.py", line 1508, in generate
return self.sample(
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/generation/utils.py", line 2547, in sample
outputs = self(
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 307, in forward
hidden_states = residual + hidden_states
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Output generated in 2.61 seconds (0.00 tokens/s, 0 tokens, context 216, seed 158459730)
Yea I don't have multi GPU yet to try it. It does split between CPU/GPU for me successfully like that. Ram balloons a little bit while generating but it does offload. Have to read up on accelerate documentation and see what's wrong.
Does my implementation fail as well? https://github.com/Ph0rk0z/text-generation-webui-testing/commit/a2c9bb0e1cbc668402740e82320a23d6c72b1f1d
And you updated accelerate?
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.
I haven't yet tried any of the workarounds, but recently hit this too: With --monkey-patch, --auto-devices or --pre_layer no longer seem to work - at least to run across GPUs. A large model loads solely to GPU0 and runs out of memory.
Monkey patch doesn't work with pre-layer. Never did. You have to specify GPU memory.
It worked when specifying GPU memory, thanks.