intel-extension-for-transformers Load Quantized model

I'm using this code to load the model in 4bit.

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import torch
from datetime import datetime
# Hugging Face model_id or local model
model_name = "microsoft/phi-2"
prompt = "Once upon a time, there existed a little girl,"

print(datetime.now(), "Start loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

print(datetime.now(), "End loading tokenizer...")

model = AutoModelForCausalLM.from_pretrained(
    model_name, load_in_4bit=True,  use_llm_runtime=False,  trust_remote_code=True)

print(datetime.now(), "End loading tokenizer...")

print(datetime.now(), "Start generating...")

outputs = model.generate(input_ids, streamer=streamer, max_new_tokens=300)

print(datetime.now(), "Done generating...")

But it's not cached and the quantization processo occurs every time.

Is there a way to save/cache the quantized model and load it?

Dec 22 '23 12:12 murilocurti

@murilocurti ,

What's your version or commit of intel_extension_for_transformers?

Dec 25 '23 00:12 zhenwei-intel

@zhenwei-intel package version 1.2.3.dev163 on Windows

Dec 25 '23 04:12 murilocurti

@murilocurti

I think this should have been fixed in the just released version. https://github.com/intel/intel-extension-for-transformers/releases/tag/v1.3

Can you reinstall it?

Dec 25 '23 06:12 zhenwei-intel

Related: Is LLM Runtime / Neural Speed support planned for Phi-2?

Jan 10 '24 11:01 bearn01d

will support it soon

Jan 24 '24 12:01 kevinintel

@murilocurti The latest Neural Speed already supports PHI2, and you can try it now

Jan 25 '24 03:01 intellinjun

@murilocurti The latest Neural Speed already supports PHI2, and you can try it now

I will try and return here. Thanks!

Jan 25 '24 03:01 murilocurti

intel-extension-for-transformers intel-extension-for-transformers copied to clipboard

Load Quantized model

intel-extension-for-transformers
intel-extension-for-transformers copied to clipboard