nncf icon indicating copy to clipboard operation
nncf copied to clipboard

[Good First Issue][NNCF]: Add INT8 weight compression conformance test for Tinyllama-1.1b PyTorch model

Open alexsu52 opened this issue 4 months ago • 6 comments

Context

This issue proposes adding a test to the post-training compression conformance suite to verify that the weights of Tinyllama-1.1b PyTorch model can be compressed to INT8 in a given time while preserving an acceptable level of model accuracy on whowhatbench

INT8 weight compression is popular approach to reduce the LLM model size by quantizing the weights from original floating point precision to INT8, leading to smaller model footprints and potentially faster inference on the target devices without significant accuracy drop.

This is code snippet for better understanding how to compress weights of Tinyllama-1.1b PyTorch model using NNCF:

import nncf
import transformers

MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"

tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
model = transformers.AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="cpu")

text = 'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.'
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}

compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))

What needs to be done?

Add INT8 weight compression test for for Tinyllama-1.1b PyTorch model to the post-training compression conformance suite so that the test can be run with the following command:

pytest tests/post_training/test_quantize_conformance.py::test_weight_compression -s --data=<path to data folder> -k [tinyllama_int8_data_free_backend_PT]

The task steps:

    {
        "reported_name": "tinyllama_int8_data_free",
        "model_id": "tinyllama/tinyllama-1.1b-step-50k-105b",
        "pipeline_cls": LMWeightCompression,
        "compression_params": {
            "mode": CompressWeightsMode.INT8_ASYM,
        },
        "backends": [BackendType.TORCH],
    },
  • Add PyTorch backend support to the LMWeightCompression class.
  • Collect golds and add to the reference file

Example Pull Requests

https://github.com/openvinotoolkit/nncf/pull/2425

Resources

Contact points

@AlexanderDokuchaev, @alexsu52

Ticket

ref: 130788

alexsu52 avatar Feb 28 '24 11:02 alexsu52

Hi, is it possible to take this one?

RedShift51 avatar Feb 28 '24 20:02 RedShift51

Hello @RedShift51, the task is assigned to you.

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

alexsu52 avatar Feb 29 '24 05:02 alexsu52

Hey, what metric value is okay for tinyllama/tinyllama-1.1b-step-50k-105b ?

RedShift51 avatar Feb 29 '24 12:02 RedShift51

Hey,

Similarity metric between float16 and int8 weight compressed tinyllama-1.1b-step-50k-105b model on whowhatbench: similarity : 0.9628345480671635

Code to reproduce:

import torch
import whowhatbench
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

import nncf

MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="auto")

evaluator = whowhatbench.Evaluator(base_model=model, tokenizer=tokenizer)

text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"].cuda(), "attention_mask": token["attention_mask"].cuda()}
compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))

metrics_per_prompt, metrics = evaluator.score(compressed_model)

metric_of_interest = "similarity"
print(metric_of_interest, ": ", metrics["similarity"][0])

alexsu52 avatar Mar 01 '24 10:03 alexsu52

Hi, sorry for the delay, I have reproduced on a cpu screenshot

import torch
import nncf
import transformers
import whowhatbench

MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"

tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
model = transformers.AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="cpu")

text = 'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.'
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}

compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))


evaluator = whowhatbench.Evaluator(base_model=compressed_model, tokenizer=tokenizer)
metrics_per_prompt, metrics = evaluator.score(compressed_model)
print(metrics)
metric_of_interest = "similarity"
print(metric_of_interest, ": ", metrics["similarity"][0])

worst_examples = evaluator.worst_examples(top_k=5, metric=metric_of_interest)
print("Metric: ", metric_of_interest)

RedShift51 avatar Mar 07 '24 12:03 RedShift51

The main idea of whowhatbench to compare original_model and compressed_model. But you have compared compressed_model with compressed_model in your code and as expected you get similarity metric = 1.

# collect outputs of original_model
evaluator = whowhatbench.Evaluator(base_model=model, tokenizer=tokenizer)
# inplace weight model compression
compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))
# collect outputs of compressed model and calculate the similarity metric.
metrics_per_prompt, metrics = evaluator.score(compressed_model)

alexsu52 avatar Mar 08 '24 05:03 alexsu52

@RedShift51, are you going to continue work on this issue? do you have any updates?

alexsu52 avatar Mar 26 '24 12:03 alexsu52

Removed assignment due to inactivity.

alexsu52 avatar Mar 28 '24 07:03 alexsu52

.take

ksj20 avatar Mar 28 '24 07:03 ksj20

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

github-actions[bot] avatar Mar 28 '24 07:03 github-actions[bot]

@alexsu52 @ksj20 Any updates on this issue? If the assignee isn't going to work on this, I'd be down to take it.

AdiKsOnDev avatar Apr 08 '24 08:04 AdiKsOnDev

.take

AdiKsOnDev avatar Apr 08 '24 11:04 AdiKsOnDev

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

github-actions[bot] avatar Apr 08 '24 11:04 github-actions[bot]

@alexsu52 @AlexanderDokuchaev If I add the following code to the LMWeightCompression.compress() and then run a benchmark right after using whowhatbench how should I store the metrics? Also please tell me if I am going in the right direction, this approach feels a bit odd so far

class LMWeightCompression(BaseTestPipeline):
...

    def compress(self) -> None:
        if self.backend == BackendType.FP32:
            return
        elif self.backend == BackendType.TORCH:
            start_time = time.perf_counter()
            MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"

            tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
            self.model = transformers.AutoModelForCausalLM.from_pretrained(
                MODEL_ID, torch_dtype=torch.float16, device_map="cpu"
            )

            text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
            token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
            inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}

            self.run_info.compression_memory_usage = memory_usage(self._compress_torch(inputs), max_usage=True)
            self.run_info.time_compression = time.perf_counter() - start_time

            return

        print("Weight compression...")
        start_time = time.perf_counter()
        self.run_info.compression_memory_usage = memory_usage(self._compress, max_usage=True)
        self.run_info.time_compression = time.perf_counter() - start_time

    def _compress_torch(self, inputs):
        self.compressed_model = nncf.compress_weights(self.model, dataset=nncf.Dataset([inputs]))

...

AdiKsOnDev avatar Apr 09 '24 11:04 AdiKsOnDev

@alexsu52 @AlexanderDokuchaev If I add the following code to the LMWeightCompression.compress() and then run a >benchmark right after using whowhatbench how should I store the metrics? Also please tell me if I am going in the right direction, this approach feels a bit odd so far

@alexsu52 @AlexanderDokuchaev following up on the above^

AdiKsOnDev avatar Apr 10 '24 20:04 AdiKsOnDev

Hi @AdiKsOnDev

Add _validate function to LMWeightCompression, that will contain call of evaluator from whowhatbench.

Example of _validate function https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/image_classification_timm.py#L127

Metrics should be stored in self.run_info https://github.com/openvinotoolkit/nncf/blob/0b407de48da8d6e30fdb11325e8a1913139d5cfb/tests/post_training/pipelines/image_classification_timm.py#L170-L171

AlexanderDokuchaev avatar Apr 10 '24 22:04 AlexanderDokuchaev

Hi @AdiKsOnDev

Add _validate function to LMWeightCompression, that will contain call of evaluator from whowhatbench.

Example of _validate function https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/image_classification_timm.py#L127

Metrics should be stored in self.run_info https://github.com/openvinotoolkit/nncf/blob/0b407de48da8d6e30fdb11325e8a1913139d5cfb/tests/post_training/pipelines/image_classification_timm.py#L170-L171

OK, thanks for the directions

AdiKsOnDev avatar Apr 10 '24 22:04 AdiKsOnDev

@AlexanderDokuchaev _validate(self) already exists in LMWeightCompression image

Git Blame

image

AdiKsOnDev avatar Apr 12 '24 11:04 AdiKsOnDev

@AlexanderDokuchaev I added following code for INT_8 support, do you want me to send a PR?

def compress(self) -> None:
    if self.backend == BackendType.FP32:
        return
    elif self.backend == BackendType.TORCH:
        start_time = time.perf_counter()
                                                                                                            
        tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_id)
        self.model = transformers.AutoModelForCausalLM.from_pretrained(
            self.model_id, torch_dtype=torch.float16, device_map="cpu"
        )
                                                                                                            
        text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
        token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
        inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
                                                                                                            
        self.run_info.compression_memory_usage = memory_usage(self._compress_torch(inputs), max_usage=True)
        self.run_info.time_compression = time.perf_counter() - start_time
                                                                                                            
        return

      print("Weight compression...")
      start_time = time.perf_counter()
      self.run_info.compression_memory_usage = memory_usage(self._compress, max_usage=True)
      self.run_info.time_compression = time.perf_counter() - start_time
    def _compress_torch(self, inputs):
        self.compressed_model = nncf.compress_weights(self.model, dataset=nncf.Dataset([inputs]))

AdiKsOnDev avatar Apr 12 '24 11:04 AdiKsOnDev