bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

quantization of T5 faild. int8 model cost more inference time and memory.

Open Worromots opened this issue 1 year ago • 2 comments

System Info

A100-80G cuda12.1 bitsandbytes 0.43.2.dev0
diffusers 0.29.1 lion-pytorch 0.2.2 torch 2.0.1 torch-tb-profiler 0.4.3 torchvision 0.16.1+cu121 xformers 0.0.22 transformers 4.31.0

Reproduction

load code

torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True model_name=''google/flan-t5-xxl'' quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_threshold=5.1) # model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir,torch_dtype=torch.float16,).to(device).eval() model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir, # torch_dtype=torch.float16, quantization_config=quantization_config)

inference code

    text_tokens_and_mask = self.tokenizer(
        texts,
        max_length=self.model_max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors='pt'
    )

    text_tokens_and_mask['input_ids'] = text_tokens_and_mask['input_ids']
    text_tokens_and_mask['attention_mask'] = text_tokens_and_mask['attention_mask']

    self.prof.step()
    with torch.no_grad():
        text_encoder_embs = self.model(
            input_ids=text_tokens_and_mask['input_ids'].to(self.device),
            attention_mask=text_tokens_and_mask['attention_mask'].to(self.device),
        )['last_hidden_state'].detach()
    return text_encoder_embs, text_tokens_and_mask['attention_mask'].to(self.device)

Expected behavior

17443MiB load in int8 with BitsAndBytesConfig. T5 encoder cost time: 96ms 11759MiB load in float16. T5 encoder cost time: 21ms

the int8 model cost more inference time and memory than fp16 model. image

the torch profile show the quantized model is using the intmul kernel.

Worromots avatar Jun 30 '24 16:06 Worromots

Hi, did you solve the issue?

weibaozi avatar Oct 22 '24 07:10 weibaozi

cc @matthewdouglas

Titus-von-Koeller avatar Oct 23 '24 17:10 Titus-von-Koeller

Hi, Please try with bitsandbytes >= 0.45.0. In that release we improved the performance of int8 quantization. Feel free to open a new issue if upgrading does not adequately resolve the problem.

matthewdouglas avatar Feb 28 '25 18:02 matthewdouglas