alpaca-lora
alpaca-lora copied to clipboard
I am using generate.py and my inference with gpu is very slow and utilization is max 40%.
I am using generate.py and my inference with gpu is very slow and utilization is max 40%. i am using Nvidia T4, 16 GB. Is there anyway to make the model faster with higher gpu utilization ?
load_8bit = True
Thanks
I am also encountering the same problem. Also I tried to use full 8 bit precision with this:
from transformers.utils.bitsandbytes import replace_8bit_linear
model = LlamaForCausalLM.from_pretrained(base_model,load_in_8bit=load_8bit,torch_dtype=torch.float16, device_map={'':0},)
model = PeftModel.from_pretrained(model,lora_weights,torch_dtype=torch.float16,device_map={'': 0})
model = replace_8bit_linear(model, threshold=6.0, modules_to_not_convert=None, current_key_name=None)
model.eval()
Now it throws an error:
AttributeError: 'NoneType' object has no attribute 'device'
using --load_8bit Flase is half as fast