quantization of T5 faild. int8 model cost more inference time and memory.

Open Worromots opened this issue 1 year ago • 2 comments

System Info

A100-80G cuda12.1 bitsandbytes 0.43.2.dev0
diffusers 0.29.1 lion-pytorch 0.2.2 torch 2.0.1 torch-tb-profiler 0.4.3 torchvision 0.16.1+cu121 xformers 0.0.22 transformers 4.31.0

Reproduction

load code

torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True model_name=''google/flan-t5-xxl'' quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_threshold=5.1) # model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir,torch_dtype=torch.float16,).to(device).eval() model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir, # torch_dtype=torch.float16, quantization_config=quantization_config)

inference code

    text_tokens_and_mask = self.tokenizer(
        texts,
        max_length=self.model_max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors='pt'
    )

    text_tokens_and_mask['input_ids'] = text_tokens_and_mask['input_ids']
    text_tokens_and_mask['attention_mask'] = text_tokens_and_mask['attention_mask']

    self.prof.step()
    with torch.no_grad():
        text_encoder_embs = self.model(
            input_ids=text_tokens_and_mask['input_ids'].to(self.device),
            attention_mask=text_tokens_and_mask['attention_mask'].to(self.device),
        )['last_hidden_state'].detach()
    return text_encoder_embs, text_tokens_and_mask['attention_mask'].to(self.device)

Expected behavior

17443MiB load in int8 with BitsAndBytesConfig. T5 encoder cost time: 96ms 11759MiB load in float16. T5 encoder cost time: 21ms

the int8 model cost more inference time and memory than fp16 model.

the torch profile show the quantized model is using the intmul kernel.

Jun 30 '24 16:06 Worromots

Hi, did you solve the issue?

Oct 22 '24 07:10 weibaozi

cc @matthewdouglas

Oct 23 '24 17:10 Titus-von-Koeller

Hi, Please try with bitsandbytes >= 0.45.0. In that release we improved the performance of int8 quantization. Feel free to open a new issue if upgrading does not adequately resolve the problem.

Feb 28 '25 18:02 matthewdouglas