optimum-intel Intel NPU operation related

Hello

I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".

However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time

def make_template(context) :
    instruction=f"""You are an assistant who translates meeting contents.
Translate the meeting contents given after #Context into English.

#Context:{context}

#Translation:"""
    
    messages=[{"role": "user", "content": f"{instruction}"}]

    input_ids=tokenizer.apply_chat_template(messages,
                                                    add_generation_prompt = True,
                                                    return_tensors="pt")

    return input_ids

def translate(context) : 
    input_ids=make_template(context=context)
    outputs=model.generate(input_ids,
                                max_new_tokens=max_new_tokens,
                                do_sample=do_sample,
                                temperature=temperature,
                                top_p=top_p)
    
    answer=tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
    
    return answer.rstrip()

if __name__ == "__main__" :
    model_id = "AIFunOver/gemma-2-2b-it-openvino-8bit"
    model = OVModelForCausalLM.from_pretrained(model_id, device="npu")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"Model Device : {model.device}")

    max_new_tokens=1024
    do_sample=False
    temperature=0.1
    top_p=0.9

    context = '''A: Hello.
B: Oh, yes, hello. I'm contacting you because I have a question. They're doing water pipe construction in my neighborhood, and I'm curious as to how long it will take.
A: Where is your area?
B: Daejeon Byeundae-dong.
A: The construction will continue until tomorrow, sir.
B: Oh really? Oh, but won't there be muddy water after the construction is over?
A: It's better to let out enough water before using it after the construction is over, sir.
B: How much water should I drain?
A: Let out for 2~3 minutes.
B: Okay, I understand. Then, can there be another problem?
A: The water pressure may temporarily drop slightly.
B: Temporarily?
A: Yes, it's a temporary phenomenon and will return to normal pressure right away.
B: What should I do if it lasts a long time?
A: In that case, you can report it to the Waterworks Headquarters.
B: Yes, I understand.
B: But they say it's going to rain tomorrow, so can the construction be finished tomorrow? I think they usually don't do construction on rainy days? A: In case of rain, construction may be slightly delayed. If it doesn't rain too much, construction will proceed as scheduled. Customer, please don't worry too much.
B: Oh, yes, I understand. Thank you.
A: Yes, thank you.'''

    start_time = time.time()
    generated_text = translate(context)
    end_time = time.time()

    print("generated_text:", generated_text)

    num_generated_tokens = len(tokenizer.tokenize(generated_text))
    total_time = end_time - start_time
    avg_token_speed = total_time / num_generated_tokens if num_generated_tokens > 0 else float('inf')

    print(f"Total Inference Time : {total_time} s")
    print(f"Average token generation speed: {avg_token_speed:.4f} seconds/token")

However, the devices currently available for openvino include NPUs.

If there is a way to use NPU, can you tell me?

Thank you.

Dec 19 '24 09:12 Oneul-hyeon

@Oneul-hyeon currently optimum-intel does not support inference sLM on NPU, but there is another solution that allow to do that and working on the same optimum-intel converted models, please check this guide https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

Dec 19 '24 12:12 eaidova

@eaidova i had done experiments with this on windows, and the conclusion was that this does not work (on windows), the process exists without providing error output, when you try that guide. see e.g. https://github.com/openvinotoolkit/openvino.genai/issues/1358

Jan 05 '25 16:01 endomorphosis