Unable to load lora for large models
Describe the bug
Spec: 2080Ti 22G *3, trying to run llama-30B + alpaca-30b from https://huggingface.co/baseten/alpaca-30b. Cannot get the lora to load without VRAM OOM. I can run llama-30B in 8bit or fp16, or llama-65B in 8bit just fine.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Attempt 1, trying to load in int8:
python server.py --model llama-30b --load-in-8bit --lora alpaca-30b --gpu-memory 16 16 16
Loading llama-30b...
CUDA SETUP: CUDA runtime path found: /home/sgsdxzy/mambaforge/envs/cu118/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/sgsdxzy/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [00:37<00:00, 1.64it/s]
Loaded the model in 38.63 seconds.
alpaca-30b
Adding the LoRA alpaca-30b to the model...
Traceback (most recent call last):
File "/home/sgsdxzy/Programs/text-generation-webui/server.py", line 247, in <module>
add_lora_to_model(shared.lora_name)
File "/home/sgsdxzy/Programs/text-generation-webui/modules/LoRA.py", line 25, in add_lora_to_model
shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/peft/peft_model.py", line 177, in from_pretrained
model = dispatch_model(model, device_map=device_map)
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/big_modeling.py", line 370, in dispatch_model
attach_align_device_hook_on_blocks(
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 495, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 495, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 495, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
[Previous line repeated 2 more times]
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 471, in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
module = hook.init_hook(module)
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 244, in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 147, in set_module_tensor_to_device
new_value = old_value.to(device)
File "/home/sgsdxzy/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 199, in to
super().to(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 21.67 GiB total capacity; 20.74 GiB already allocated; 55.75 MiB free; 21.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Attempt 2, trying to load in int8:, but make GPU 0 as empty as possible
python server.py --model llama-30b --lora alpaca-30b --gpu-memory 1 20 20
Still OOM on GPU 0.
Attempt 3, trying to load in fp16:
python server.py --model llama-30b --lora alpaca-30b --gpu-memory 21 21 21
And get similar OOM.
Screenshot
No response
Logs
See Reproduction.
System Info
pytorch: 2.0.0
cuda: 11.8
system: Linux
GPU: 2080Ti 22G * 3
I think its likely because of this code:
params = {}
params['device_map'] = {'': 0}
#params['dtype'] = shared.model.dtype
shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
See how it resets the 'device_map' that has the information about which GPU to use. So it tries to reload the model on the first GPU, and it fails because the model is too big.
@sgsdxzy I have managed to load a LoRA in 16-bit mode with CPU offloading with --gpu-memory 10000MiB, but I haven't made tests with multi-gpu setups. If you can find a modification to LoRA.py that will make this work, let me know.
@ortegaalfredo the code you pasted has been updated yesterday https://github.com/oobabooga/text-generation-webui/blob/main/modules/LoRA.py
@oobabooga Now that @ortegaalfredo has pinned down the problem, this is easy to fix by replicating the original device map:
params['device_map'] = {"base_model.model."+k: v for k, v in shared.model.hf_device_map.items()}
and it works now!
I confirm that @sgsdxzy patch now successfully loads alpaca-lora-30b on 2x3090 GPUs using int8 quantization
I have incorported the @sgsdxzy's device map here (thanks again for sharing the code):
https://github.com/oobabooga/text-generation-webui/commit/9bf6ecf9e2de9b72c3fa62e0e6f5b5e9041825b1
@sgsdxzy @ortegaalfredo can you test if things are working as expected on your end now?
I'm on 8747c74339cf1e7f1d45f4aa1dcc090e9eba94a3, now it loads Lora and 30b in 2x3090, no problem.
I am wondering if the model.half() is still necessary, as it can take several minutes for large models.
I got that from here https://github.com/tloen/alpaca-lora/blob/main/generate.py#L93
Without model.half(), I would get errors about "expected half but got float" while trying to generate text.
I think the if the model is already in half, probably there's a way to load the lora directly in desired dtype, without re-converting the entire model from half to half again. I will experiment with it tonight. Anyway that's beyond the scope of this issue, and this issue is fixed by https://github.com/oobabooga/text-generation-webui/commit/8747c74339cf1e7f1d45f4aa1dcc090e9eba94a3