alpaca-lora
alpaca-lora copied to clipboard
Finetuned Model Inference error: AttributeError: 'NoneType' object has no attribute 'device'
Update: for anyone experiencing this issue, see the workaround I posted in https://github.com/tloen/alpaca-lora/issues/14#issuecomment-1471263165
I tried out the finetune script locally and and it looks like there was no problem with that. However, when trying to run inference, I'm getting AttributeError: 'NoneType' object has no attribute 'device'
from bitsandbytes. I've checked and looks like it was an issue related to model sharing on cpu and gpu, but I am not sure which part of this repo is causing that. Any idea?
Relevant issue in bitsandbytes: https://github.com/TimDettmers/bitsandbytes/issues/40
Which version of bitsandbytes are you on
@devilismyfriend
0.37.0 latest release
I have the same issue. I only get it when I try to run inference with my local fine tune, the downloaded one doesn't have the problem. I am on the latest bits and bytes commit built from source.
Maybe try allocating the foundation model on the CPU? device_map={'': 'cpu'}
That might save some VRAM for the LoRA model.
Changing device_map to cpu did not help for me, still getting the same stack trace.
It looks like the downloaded model is using {'base_model': 0}
device map, which only loads in GPU.
Local finetune device map looks like:
{'base_model.model.model.embed_tokens': 0, 'base_model.model.model.layers.0': 0, 'base_model.model.model.layers.1': 0, 'base_model.model.model.layers.2': 0, 'base_model.model.model.layers.3': 0, 'base_model.model.model.layers.4': 0, 'base_model.model.model.layers.5': 0, 'base_model.model.model.layers.6': 0, 'base_model.model.model.layers.7': 0, 'base_model.model.model.layers.8': 0, 'base_model.model.model.layers.9': 0, 'base_model.model.model.layers.10': 0, 'base_model.model.model.layers.11': 0, 'base_model.model.model.layers.12': 0, 'base_model.model.model.layers.13': 0, 'base_model.model.model.layers.14': 0, 'base_model.model.model.layers.15': 0, 'base_model.model.model.layers.16': 0, 'base_model.model.model.layers.17': 0, 'base_model.model.model.layers.18': 0, 'base_model.model.model.layers.19': 0, 'base_model.model.model.layers.20': 0, 'base_model.model.model.layers.21': 0, 'base_model.model.model.layers.22': 0, 'base_model.model.model.layers.23': 0, 'base_model.model.model.layers.24': 0, 'base_model.model.model.layers.25': 0, 'base_model.model.model.layers.26': 0, 'base_model.model.model.layers.27': 'cpu', 'base_model.model.model.layers.28': 'cpu', 'base_model.model.model.layers.29': 'cpu', 'base_model.model.model.layers.30': 'cpu', 'base_model.model.model.layers.31': 'cpu', 'base_model.model.model.layers.32': 'cpu', 'base_model.model.model.layers.33': 'cpu', 'base_model.model.model.layers.34': 'cpu', 'base_model.model.model.layers.35': 'cpu', 'base_model.model.model.layers.36': 'cpu', 'base_model.model.model.layers.37': 'cpu', 'base_model.model.model.layers.38': 'cpu', 'base_model.model.model.layers.39': 'cpu', 'base_model.model.model.norm': 'cpu', 'base_model.model.lm_head': 'cpu'}
@ItsLogic
Right now I am forcing device_map to use only the GPU, ie adding device_map={'': 0}
to PeftModel.from_pretrained
, which worked.
Looks like the issue is that Peft's load will auto apply a device_map if not specified, which will load some of the model weights with cpu. This is unforunately not compatible with bitsandbytes. Forcing peft to use only the GPU is the workaround I found.
Right now I am forcing device_map to use only the GPU, ie adding
device_map={'': 0}
toPeftModel.from_pretrained
, which worked.
This seems to work for me as well. Cheers now I can use my 13B lora
Right now I am forcing device_map to use only the GPU, ie adding
device_map={'': 0}
toPeftModel.from_pretrained
, which worked.
had the same problem with the stock generate.py, this fixed it for me as well. can confirm it works on a RTX 3060 with 12GB (9.9GB in use). but nvtop reports only 30% GPU usage. there's a bottleneck somewhere.
Also, uncommenting and executing the original test code failed with the last sample with an OOM error. using the gradio UI i get about 1GB of extra memory used after each request, so i'd say it's a leak. I added import gc; gc.collect()
to generate and that seems to fix it, but long responses can also trigger OOM. Limiting tokens to 128 did help.
So to clarify, the changes I had to apply was in generate.py:
model = PeftModel.from_pretrained(
model, "tloen/alpaca-lora-7b",
torch_dtype=torch.float16
)
change this to:
model = PeftModel.from_pretrained(
model, "tloen/alpaca-lora-7b",
torch_dtype=torch.float16,
device_map={'': 0}
)
This may be fixed by this PEFT PR
This may be fixed by a recent PR on accelerate
that supports weights quantization for dispatch_model
function. Related PR: https://github.com/huggingface/accelerate/pull/1237 - can you try to use the main
branch of accelerate
by installing it from source?
pip install git+https://github.com/huggingface/accelerate
https://github.com/huggingface/peft/issues/115#issuecomment-1504411743