alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

I am using generate.py and my inference with gpu is very slow and utilization is max 40%.

Open bhupendrathore opened this issue 1 year ago • 1 comments

I am using generate.py and my inference with gpu is very slow and utilization is max 40%. i am using Nvidia T4, 16 GB. Is there anyway to make the model faster with higher gpu utilization ?

load_8bit = True

Thanks

bhupendrathore avatar May 09 '23 13:05 bhupendrathore

I am also encountering the same problem. Also I tried to use full 8 bit precision with this:

from transformers.utils.bitsandbytes import replace_8bit_linear

model = LlamaForCausalLM.from_pretrained(base_model,load_in_8bit=load_8bit,torch_dtype=torch.float16, device_map={'':0},)
model = PeftModel.from_pretrained(model,lora_weights,torch_dtype=torch.float16,device_map={'': 0})
model = replace_8bit_linear(model, threshold=6.0, modules_to_not_convert=None, current_key_name=None)
model.eval()

Now it throws an error:

AttributeError: 'NoneType' object has no attribute 'device'

sulabh-salesken avatar May 09 '23 13:05 sulabh-salesken

using --load_8bit Flase is half as fast

hyyfengshang avatar Jun 08 '23 11:06 hyyfengshang