Baichuan-13B 8bit量化加载后推理速度变慢

8bit量化加载后推理速度变慢

Open NLPerxue opened this issue 1 year ago • 6 comments

使用如下方式加载模型，但是推理速度变慢约1.5倍，且模型性能下降明显，是真么原因呢？ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, load_in_4bit=False, load_in_8bit=True, device_map='auto', max_memory={i: '24000MB' for i in range(torch.cuda.device_count())}, torch_dtype=torch.float32, trust_remote_code=True ).eval() model.generation_config = GenerationConfig.from_pretrained(model_path) # print(model.generation_config)