nncf
nncf copied to clipboard
[Good First Issue][NNCF]: Add INT8 weight compression conformance test for Tinyllama-1.1b PyTorch model
Context
This issue proposes adding a test to the post-training compression conformance suite to verify that the weights of Tinyllama-1.1b PyTorch model can be compressed to INT8 in a given time while preserving an acceptable level of model accuracy on whowhatbench
INT8 weight compression is popular approach to reduce the LLM model size by quantizing the weights from original floating point precision to INT8, leading to smaller model footprints and potentially faster inference on the target devices without significant accuracy drop.
This is code snippet for better understanding how to compress weights of Tinyllama-1.1b PyTorch model using NNCF:
import nncf
import transformers
MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
model = transformers.AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="cpu")
text = 'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.'
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))
What needs to be done?
Add INT8 weight compression test for for Tinyllama-1.1b PyTorch model to the post-training compression conformance suite so that the test can be run with the following command:
pytest tests/post_training/test_quantize_conformance.py::test_weight_compression -s --data=<path to data folder> -k [tinyllama_int8_data_free_backend_PT]
The task steps:
- Study the post-training compression conformance suite
- Add to the model_scope test scenario:
{
"reported_name": "tinyllama_int8_data_free",
"model_id": "tinyllama/tinyllama-1.1b-step-50k-105b",
"pipeline_cls": LMWeightCompression,
"compression_params": {
"mode": CompressWeightsMode.INT8_ASYM,
},
"backends": [BackendType.TORCH],
},
- Add PyTorch backend support to the
LMWeightCompression
class. - Collect golds and add to the reference file
Example Pull Requests
https://github.com/openvinotoolkit/nncf/pull/2425
Resources
Contact points
@AlexanderDokuchaev, @alexsu52
Ticket
ref: 130788
Hi, is it possible to take this one?
Hello @RedShift51, the task is assigned to you.
Thank you for looking into this issue! Please let us know if you have any questions or require any help.
Hey, what metric value is okay for tinyllama/tinyllama-1.1b-step-50k-105b ?
Hey,
Similarity metric between float16 and int8 weight compressed tinyllama-1.1b-step-50k-105b model on whowhatbench: similarity : 0.9628345480671635
Code to reproduce:
import torch
import whowhatbench
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import nncf
MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="auto")
evaluator = whowhatbench.Evaluator(base_model=model, tokenizer=tokenizer)
text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"].cuda(), "attention_mask": token["attention_mask"].cuda()}
compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))
metrics_per_prompt, metrics = evaluator.score(compressed_model)
metric_of_interest = "similarity"
print(metric_of_interest, ": ", metrics["similarity"][0])
Hi, sorry for the delay, I have reproduced on a cpu
import torch
import nncf
import transformers
import whowhatbench
MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
model = transformers.AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="cpu")
text = 'The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.'
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))
evaluator = whowhatbench.Evaluator(base_model=compressed_model, tokenizer=tokenizer)
metrics_per_prompt, metrics = evaluator.score(compressed_model)
print(metrics)
metric_of_interest = "similarity"
print(metric_of_interest, ": ", metrics["similarity"][0])
worst_examples = evaluator.worst_examples(top_k=5, metric=metric_of_interest)
print("Metric: ", metric_of_interest)
The main idea of whowhatbench to compare original_model
and compressed_model
. But you have compared compressed_model
with compressed_model
in your code and as expected you get similarity metric = 1.
# collect outputs of original_model
evaluator = whowhatbench.Evaluator(base_model=model, tokenizer=tokenizer)
# inplace weight model compression
compressed_model = nncf.compress_weights(model, dataset=nncf.Dataset([inputs]))
# collect outputs of compressed model and calculate the similarity metric.
metrics_per_prompt, metrics = evaluator.score(compressed_model)
@RedShift51, are you going to continue work on this issue? do you have any updates?
Removed assignment due to inactivity.
.take
Thank you for looking into this issue! Please let us know if you have any questions or require any help.
@alexsu52 @ksj20 Any updates on this issue? If the assignee isn't going to work on this, I'd be down to take it.
.take
Thank you for looking into this issue! Please let us know if you have any questions or require any help.
@alexsu52 @AlexanderDokuchaev If I add the following code to the LMWeightCompression.compress()
and then run a benchmark right after using whowhatbench
how should I store the metrics?
Also please tell me if I am going in the right direction, this approach feels a bit odd so far
class LMWeightCompression(BaseTestPipeline):
...
def compress(self) -> None:
if self.backend == BackendType.FP32:
return
elif self.backend == BackendType.TORCH:
start_time = time.perf_counter()
MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
self.model = transformers.AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="cpu"
)
text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
self.run_info.compression_memory_usage = memory_usage(self._compress_torch(inputs), max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
return
print("Weight compression...")
start_time = time.perf_counter()
self.run_info.compression_memory_usage = memory_usage(self._compress, max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
def _compress_torch(self, inputs):
self.compressed_model = nncf.compress_weights(self.model, dataset=nncf.Dataset([inputs]))
...
@alexsu52 @AlexanderDokuchaev If I add the following code to the LMWeightCompression.compress() and then run a >benchmark right after using whowhatbench how should I store the metrics? Also please tell me if I am going in the right direction, this approach feels a bit odd so far
@alexsu52 @AlexanderDokuchaev following up on the above^
Hi @AdiKsOnDev
Add _validate
function to LMWeightCompression, that will contain call of evaluator from whowhatbench.
Example of _validate function https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/image_classification_timm.py#L127
Metrics should be stored in self.run_info
https://github.com/openvinotoolkit/nncf/blob/0b407de48da8d6e30fdb11325e8a1913139d5cfb/tests/post_training/pipelines/image_classification_timm.py#L170-L171
Hi @AdiKsOnDev
Add
_validate
function to LMWeightCompression, that will contain call of evaluator from whowhatbench.Example of _validate function https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/image_classification_timm.py#L127
Metrics should be stored in
self.run_info
https://github.com/openvinotoolkit/nncf/blob/0b407de48da8d6e30fdb11325e8a1913139d5cfb/tests/post_training/pipelines/image_classification_timm.py#L170-L171
OK, thanks for the directions
@AlexanderDokuchaev _validate(self)
already exists in LMWeightCompression
Git Blame
@AlexanderDokuchaev I added following code for INT_8
support, do you want me to send a PR?
def compress(self) -> None:
if self.backend == BackendType.FP32:
return
elif self.backend == BackendType.TORCH:
start_time = time.perf_counter()
tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_id)
self.model = transformers.AutoModelForCausalLM.from_pretrained(
self.model_id, torch_dtype=torch.float16, device_map="cpu"
)
text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
self.run_info.compression_memory_usage = memory_usage(self._compress_torch(inputs), max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
return
print("Weight compression...")
start_time = time.perf_counter()
self.run_info.compression_memory_usage = memory_usage(self._compress, max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
def _compress_torch(self, inputs):
self.compressed_model = nncf.compress_weights(self.model, dataset=nncf.Dataset([inputs]))