quantization of T5 faild. int8 model cost more inference time and memory.
System Info
A100-80G
cuda12.1
bitsandbytes 0.43.2.dev0
diffusers 0.29.1
lion-pytorch 0.2.2
torch 2.0.1
torch-tb-profiler 0.4.3
torchvision 0.16.1+cu121
xformers 0.0.22
transformers 4.31.0
Reproduction
load code
torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True model_name=''google/flan-t5-xxl'' quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_threshold=5.1) # model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir,torch_dtype=torch.float16,).to(device).eval() model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir, # torch_dtype=torch.float16, quantization_config=quantization_config)
inference code
text_tokens_and_mask = self.tokenizer(
texts,
max_length=self.model_max_length,
padding='max_length',
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors='pt'
)
text_tokens_and_mask['input_ids'] = text_tokens_and_mask['input_ids']
text_tokens_and_mask['attention_mask'] = text_tokens_and_mask['attention_mask']
self.prof.step()
with torch.no_grad():
text_encoder_embs = self.model(
input_ids=text_tokens_and_mask['input_ids'].to(self.device),
attention_mask=text_tokens_and_mask['attention_mask'].to(self.device),
)['last_hidden_state'].detach()
return text_encoder_embs, text_tokens_and_mask['attention_mask'].to(self.device)
Expected behavior
17443MiB load in int8 with BitsAndBytesConfig. T5 encoder cost time: 96ms 11759MiB load in float16. T5 encoder cost time: 21ms
the int8 model cost more inference time and memory than fp16 model.
the torch profile show the quantized model is using the intmul kernel.
Hi, did you solve the issue?
cc @matthewdouglas
Hi, Please try with bitsandbytes >= 0.45.0. In that release we improved the performance of int8 quantization. Feel free to open a new issue if upgrading does not adequately resolve the problem.