text-generation-webui 4bit LoRA "--monkey-patch" breaks "--gpu-memory" Model Splitting for Multi-GPU

Describe the bug

Trying to apply kuleshov/llama-65b-4bit to Neko-Institute-of-Science/LLaMA-65B-4bit-128g

--monkey-patch seems to ignore --gpu-memory

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

python server.py --wbits 4 --gpu-memory 15 19 --listen --model llama-65b --lora alpaca-65b-4bit --verbose --chat --groupsize 128 --no-fused_mlp --monkey-patch

Screenshot

No response

Logs

(textgen1) user@hostname:~/Documents/ooba/1/text-generation-webui$ python server.py --wbits 4 --gpu-memory 15 19 --listen --model llama-65b --lora alpaca-65b-4bit --verbose --chat --groupsize 128 --no-fused_mlp --monkey-patch
bin /home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
Loading llama-65b...
Warning: applying the monkey patch for using LoRAs in 4-bit mode.
It may cause undefined behavior outside its intended scope.
Loading Model ...
The safetensors archive passed at models/llama-65b-4bit.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Traceback (most recent call last):
  File "/home/user/Documents/ooba/1/text-generation-webui/server.py", line 905, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/user/Documents/ooba/1/text-generation-webui/modules/models.py", line 110, in load_model
    model, tokenizer = load_model_llama(model_name)
  File "/home/user/Documents/ooba/1/text-generation-webui/modules/monkey_patch_gptq_lora.py", line 23, in load_model_llama
    model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=False)
  File "/home/user/Documents/ooba/1/text-generation-webui/repositories/alpaca_lora_4bit/autograd_4bit.py", line 202, in load_llama_model_4bit_low_ram
    model = accelerate.load_checkpoint_and_dispatch(
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 924, in load_checkpoint_in_model
    checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 822, in load_state_dict
    tensors[key] = f.get_tensor(key)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.01 GiB already allocated; 4.81 MiB free; 23.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

...-:::::-...                 user@hostname 
          .-MMMMMMMMMMMMMMM-.              ------------- 
      .-MMMM`..-:::::::-..`MMMM-.          OS: Linux Mint 21.1 x86_64 
    .:MMMM.:MMMMMMMMMMMMMMM:.MMMM:.        Kernel: 5.15.0-69-generic 
   -MMM-M---MMMMMMMMMMMMMMMMMMM.MMM-       Uptime: 3 hours, 12 mins 
 `:MMM:MM`  :MMMM:....::-...-MMMM:MMM:`    Packages: 2444 (dpkg) 
 :MMM:MMM`  :MM:`  ``    ``  `:MMM:MMM:    Shell: bash 5.1.16 
.MMM.MMMM`  :MM.  -MM.  .MM-  `MMMM.MMM.   Resolution: 2560x1440 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   DE: Cinnamon 5.6.8 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM:MMM:   WM: Mutter (Muffin) 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   WM Theme: Mint-Y-Dark-Aqua (Mint-Y) 
.MMM.MMMM`  :MM:--:MM:--:MM:  `MMMM.MMM.   Theme: Mint-Y-Dark-Aqua [GTK2/3] 
 :MMM:MMM-  `-MMMMMMMMMMMM-`  -MMM-MMM:    Icons: Mint-Y-Dark-Aqua [GTK2/3] 
  :MMM:MMM:`                `:MMM:MMM:     Terminal: gnome-terminal 
   .MMM.MMMM:--------------:MMMM.MMM.      CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz 
     '-MMMM.-MMMMMMMMMMMMMMM-.MMMM-'       GPU: NVIDIA GeForce RTX 3090 
       '.-MMMM``--:::::--``MMMM-.'         GPU: NVIDIA GeForce RTX 3090 
            '-MMMMMMMMMMMMM-'              Memory: 9801MiB / 80344MiB 
               ``-:::::-``

Apr 17 '23 07:04 practical-dreamer

You need to use different functions than the monkey patch to do offloading. This is why I did not like that approach.

         if shared.args.gpu_memory or torch.cuda.device_count() > 1:         
            model, tokenizer = load_llama_model_4bit_low_ram_and_offload(str(path_to_model), str(pt_path), lora_path=None, groupsize=shared.args.groupsize, seqlen=2048, max_memory=calculate_device_mem(), is_v1_model=shared.args.v1)  
         else:

I only have CPU but perhaps device map might also have to be passed into the function and that is an upstream change.. it does seem to do exactly what the devicemap in gptq_loader does though.

Apr 17 '23 12:04 Ph0rk0z

Ok so I managed to get the model to LOAD by using your suggestion

just change monkey_patch_gptq_lora.py as indicated below

def load_model_llama(model_name):

    config_path = str(Path(f'{shared.args.model_dir}/{model_name}'))
    model_path = str(find_quantized_model_file(model_name))

>    max_memory = {i: f"{shared.args.gpu_memory[i]}GiB" for i in range(len(shared.args.gpu_memory))}
>    if shared.args.gpu_memory or torch.cuda.device_count() > 1:        
>        model, tokenizer = load_llama_model_4bit_low_ram_and_offload(
>            config_path,
>            model_path,
>            lora_path=str(Path(f'{shared.args.lora_dir}/{shared.args.lora}')),
>            groupsize=shared.args.groupsize,
>            seqlen=2048,
>            max_memory=max_memory,
>            is_v1_model=False,
>        )
>    else:
>        model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=False)

    for n, m in model.named_modules():

And while it's nice to LOAD the model on two GPUs I can't do much with it as I'm still met with the following error when I try to generate anything: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA_gather) error...

I guess loading it in is a start right?... heh

Apr 18 '23 01:04 practical-dreamer

I thought that not splitting "LlamaDecoderLayer" was enough is it not? I only did offloading to CPU with this.

Apr 18 '23 13:04 Ph0rk0z

I thought that not splitting "LlamaDecoderLayer" was enough is it not? I only did offloading to CPU with this.

If by not splitting "LlamaDecodeLayer" you mean modifying autograd_4bit.py on this line

print('Dispatching model ...')
>    device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
    model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True, main_device=0)

to this:

print('Dispatching model ...')
>    device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory)
    model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True, main_device=0)

Then no... no it does not change things... it still errors in text_generation.py

Traceback (most recent call last):
  File "/home/user/Documents/ooba/1/text-generation-webui/modules/callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/user/Documents/ooba/1/text-generation-webui/modules/text_generation.py", line 252, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/peft/peft_model.py", line 716, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/generation/utils.py", line 1508, in generate
    return self.sample(
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/generation/utils.py", line 2547, in sample
    outputs = self(
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/textgen1/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 307, in forward
    hidden_states = residual + hidden_states
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Output generated in 2.61 seconds (0.00 tokens/s, 0 tokens, context 216, seed 158459730)

Apr 19 '23 07:04 practical-dreamer

Yea I don't have multi GPU yet to try it. It does split between CPU/GPU for me successfully like that. Ram balloons a little bit while generating but it does offload. Have to read up on accelerate documentation and see what's wrong.

Does my implementation fail as well? https://github.com/Ph0rk0z/text-generation-webui-testing/commit/a2c9bb0e1cbc668402740e82320a23d6c72b1f1d

And you updated accelerate?

Apr 19 '23 13:04 Ph0rk0z

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

May 19 '23 23:05 github-actions[bot]

I haven't yet tried any of the workarounds, but recently hit this too: With --monkey-patch, --auto-devices or --pre_layer no longer seem to work - at least to run across GPUs. A large model loads solely to GPU0 and runs out of memory.

May 22 '23 16:05 dblacknc

Monkey patch doesn't work with pre-layer. Never did. You have to specify GPU memory.

May 23 '23 11:05 Ph0rk0z

It worked when specifying GPU memory, thanks.

May 23 '23 13:05 dblacknc

text-generation-webui text-generation-webui copied to clipboard

4bit LoRA "--monkey-patch" breaks "--gpu-memory" Model Splitting for Multi-GPU

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard