intel-extension-for-transformers
intel-extension-for-transformers copied to clipboard
Load Quantized model
I'm using this code to load the model in 4bit.
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import torch
from datetime import datetime
# Hugging Face model_id or local model
model_name = "microsoft/phi-2"
prompt = "Once upon a time, there existed a little girl,"
print(datetime.now(), "Start loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
print(datetime.now(), "End loading tokenizer...")
model = AutoModelForCausalLM.from_pretrained(
model_name, load_in_4bit=True, use_llm_runtime=False, trust_remote_code=True)
print(datetime.now(), "End loading tokenizer...")
print(datetime.now(), "Start generating...")
outputs = model.generate(input_ids, streamer=streamer, max_new_tokens=300)
print(datetime.now(), "Done generating...")
But it's not cached and the quantization processo occurs every time.
Is there a way to save/cache the quantized model and load it?
@murilocurti ,
What's your version or commit of intel_extension_for_transformers?
@zhenwei-intel package version 1.2.3.dev163 on Windows
@murilocurti
I think this should have been fixed in the just released version. https://github.com/intel/intel-extension-for-transformers/releases/tag/v1.3
Can you reinstall it?
Related: Is LLM Runtime / Neural Speed support planned for Phi-2?
will support it soon
@murilocurti The latest Neural Speed already supports PHI2, and you can try it now
@murilocurti The latest Neural Speed already supports PHI2, and you can try it now
I will try and return here. Thanks!