text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Unable to load lora for large models

Open sgsdxzy opened this issue 2 years ago • 3 comments

Describe the bug

Spec: 2080Ti 22G *3, trying to run llama-30B + alpaca-30b from https://huggingface.co/baseten/alpaca-30b. Cannot get the lora to load without VRAM OOM. I can run llama-30B in 8bit or fp16, or llama-65B in 8bit just fine.

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Attempt 1, trying to load in int8:

python server.py --model llama-30b --load-in-8bit --lora alpaca-30b --gpu-memory 16 16 16
Loading llama-30b...

CUDA SETUP: CUDA runtime path found: /home/sgsdxzy/mambaforge/envs/cu118/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/sgsdxzy/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [00:37<00:00,  1.64it/s]
Loaded the model in 38.63 seconds.
alpaca-30b
Adding the LoRA alpaca-30b to the model...
Traceback (most recent call last):
  File "/home/sgsdxzy/Programs/text-generation-webui/server.py", line 247, in <module>
    add_lora_to_model(shared.lora_name)
  File "/home/sgsdxzy/Programs/text-generation-webui/modules/LoRA.py", line 25, in add_lora_to_model
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/peft/peft_model.py", line 177, in from_pretrained
    model = dispatch_model(model, device_map=device_map)
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/big_modeling.py", line 370, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 495, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 495, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 495, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  [Previous line repeated 2 more times]
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 471, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 244, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 147, in set_module_tensor_to_device
    new_value = old_value.to(device)
  File "/home/sgsdxzy/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 199, in to
    super().to(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 21.67 GiB total capacity; 20.74 GiB already allocated; 55.75 MiB free; 21.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Attempt 2, trying to load in int8:, but make GPU 0 as empty as possible

python server.py --model llama-30b --lora alpaca-30b --gpu-memory  1 20 20

Still OOM on GPU 0.

Attempt 3, trying to load in fp16:

python server.py --model llama-30b --lora alpaca-30b --gpu-memory 21 21 21

And get similar OOM.

Screenshot

No response

Logs

See Reproduction.

System Info

pytorch: 2.0.0 
cuda: 11.8
system: Linux
GPU: 2080Ti 22G * 3

sgsdxzy avatar Mar 23 '23 16:03 sgsdxzy

I think its likely because of this code:

    params = {}
    params['device_map'] = {'': 0}
    #params['dtype'] = shared.model.dtype
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)

See how it resets the 'device_map' that has the information about which GPU to use. So it tries to reload the model on the first GPU, and it fails because the model is too big.

ortegaalfredo avatar Mar 23 '23 16:03 ortegaalfredo

@sgsdxzy I have managed to load a LoRA in 16-bit mode with CPU offloading with --gpu-memory 10000MiB, but I haven't made tests with multi-gpu setups. If you can find a modification to LoRA.py that will make this work, let me know.

@ortegaalfredo the code you pasted has been updated yesterday https://github.com/oobabooga/text-generation-webui/blob/main/modules/LoRA.py

oobabooga avatar Mar 23 '23 16:03 oobabooga

@oobabooga Now that @ortegaalfredo has pinned down the problem, this is easy to fix by replicating the original device map:

params['device_map'] = {"base_model.model."+k: v for k, v in shared.model.hf_device_map.items()}

and it works now!

sgsdxzy avatar Mar 23 '23 16:03 sgsdxzy

I confirm that @sgsdxzy patch now successfully loads alpaca-lora-30b on 2x3090 GPUs using int8 quantization

ortegaalfredo avatar Mar 23 '23 18:03 ortegaalfredo

I have incorported the @sgsdxzy's device map here (thanks again for sharing the code):

https://github.com/oobabooga/text-generation-webui/commit/9bf6ecf9e2de9b72c3fa62e0e6f5b5e9041825b1

@sgsdxzy @ortegaalfredo can you test if things are working as expected on your end now?

oobabooga avatar Mar 23 '23 19:03 oobabooga

I'm on 8747c74339cf1e7f1d45f4aa1dcc090e9eba94a3, now it loads Lora and 30b in 2x3090, no problem.

ortegaalfredo avatar Mar 24 '23 02:03 ortegaalfredo

I am wondering if the model.half() is still necessary, as it can take several minutes for large models.

sgsdxzy avatar Mar 24 '23 03:03 sgsdxzy

I got that from here https://github.com/tloen/alpaca-lora/blob/main/generate.py#L93

Without model.half(), I would get errors about "expected half but got float" while trying to generate text.

oobabooga avatar Mar 24 '23 03:03 oobabooga

I think the if the model is already in half, probably there's a way to load the lora directly in desired dtype, without re-converting the entire model from half to half again. I will experiment with it tonight. Anyway that's beyond the scope of this issue, and this issue is fixed by https://github.com/oobabooga/text-generation-webui/commit/8747c74339cf1e7f1d45f4aa1dcc090e9eba94a3

sgsdxzy avatar Mar 24 '23 03:03 sgsdxzy