openvino.genai Issue with Running benchmark.py

I am encountering an issue while using the following setup:

Tool: openvino.genai/tools /llm_bench/benchmark.py Device: Intel Meteor Lake Environment: Ubuntu 24.04 / OpenVINO 2024.5.0 / GPU Driver / NPU Driver Model used:

CPU / iGPU: llama-3.1-8b-instruct (INT4) NPU: llama-3-8b-instruct (INT4-NPU)

When running the command, I received the following error message:

python3 benchmark.py -m <path>/llama-3.1-8b-instruct/ -n 2 -d CPU -p "What is large language model (LLM)?" -ic 50

Message:

Run on CPU :

(llama3) adv@adv-Default-string:~/Downloads/openvino.genai/tools/llm_bench$ python3 benchmark.py -m /opt/Advantech/EdgeAISuite/Intel_Standard/GenAI/LLM/model/llama-3.1-8b-instruct/ -n 2 -d CPU -p "What is large language model (LLM)? please reply under 100 words" -ic 50 The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: llama-3.1-8b-instruct [ INFO ] OV Config={'CACHE_DIR': ''} [ WARNING ] It is recommended to set the environment variable OMP_WAIT_POLICY to PASSIVE, so that OpenVINO inference can use all CPU resources without waiting. [ INFO ] The num_beams is 1, update Torch thread num from 16 to 8, avoid to use the CPU cores for OpenVINO inference. [ INFO ] Model path=/opt/Advantech/EdgeAISuite/Intel_Standard/GenAI/LLM/model/llama-3.1-8b-instruct, openvino runtime version: 2024.5.0-17288-7975fa5da0c-refs/pull/3856/head [ INFO ] Selected OpenVINO GenAI for benchmarking [ INFO ] Pipeline initialization time: 1.47s [ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 2, prompt nums: 1, prompt idx: [0] [ INFO ] [warm-up][P0] Input text: What is large language model (LLM)? please reply under 100 words [ ERROR ] An exception occurred [ INFO ] Traceback (most recent call last): File "/home/adv/Downloads/openvino.genai/tools/llm_bench/benchmark.py", line 229, in main iter_data_list, pretrain_time, iter_timestamp = CASE_TO_BENCH[model_args['use_case']]( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/Downloads/openvino.genai/tools/llm_bench/task/text_generation.py", line 515, in run_text_generation_benchmark text_gen_fn(input_text, num, model, tokenizer, args, iter_data_list, md5_list, File "/home/adv/Downloads/openvino.genai/tools/llm_bench/task/text_generation.py", line 305, in run_text_generation_genai inference_durations = (np.array(perf_metrics.raw_metrics.token_infer_durations) / 1000 / 1000).tolist() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'openvino_genai.py_openvino_genai.RawPerfMetrics' object has no attribute 'token_infer_durations'. Did you mean: 'tokenization_durations'?

Run on NPU:

(llama3) adv@adv-Default-string:~/Downloads/openvino.genai/tools/llm_bench$ python3 benchmark.py -m /home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama/INT4-NPU_compressed_weights/ -n 2 -d NPU -p "What is large language model (LLM)? please reply under 100 words" -ic 50 The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: llama [ INFO ] OV Config={'CACHE_DIR': ''} [ INFO ] Model path=/home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama/INT4-NPU_compressed_weights, openvino runtime version: 2024.5.0-17288-7975fa5da0c-refs/pull/3856/head [ INFO ] Selected OpenVINO GenAI for benchmarking [ INFO ] Pipeline initialization time: 21.64s [ INFO ] Numbeams: 1, benchmarking iter nums(exclude warm-up): 2, prompt nums: 1, prompt idx: [0] [ INFO ] [warm-up][P0] Input text: What is large language model (LLM)? please reply under 100 words [ ERROR ] An exception occurred [ INFO ] Traceback (most recent call last): File "/home/adv/Downloads/openvino.genai/tools/llm_bench/benchmark.py", line 229, in main iter_data_list, pretrain_time, iter_timestamp = CASE_TO_BENCH[model_args['use_case']]( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/Downloads/openvino.genai/tools/llm_bench/task/text_generation.py", line 515, in run_text_generation_benchmark text_gen_fn(input_text, num, model, tokenizer, args, iter_data_list, md5_list, File "/home/adv/Downloads/openvino.genai/tools/llm_bench/task/text_generation.py", line 305, in run_text_generation_genai inference_durations = (np.array(perf_metrics.raw_metrics.token_infer_durations) / 1000 / 1000).tolist() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'openvino_genai.py_openvino_genai.RawPerfMetrics' object has no attribute 'token_infer_durations'. Did you mean: 'tokenization_durations'?

Could you please help me resolve this issue?

Dec 23 '24 06:12 tim102187S

@tim102187S you need to update your openvino-genai package and install it from nightly:

pip install -U --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly openvino_tokenizers openvino openvino-genai

after that, these metircs become available or you can switch to 2024/6 branch if you would like to use llm_bench compatible with latest stable release

Dec 23 '24 11:12 eaidova

Having the same issue and suggestion from @eaidova resolved the issue. Thank you.

Dec 24 '24 06:12 Kpeacef

I have recently updated my OpenVINO-GenAI package and switched to version 2024/6. While the issue with the CPU has been resolved, I am now encountering a problem when using the NPU.

Tool: openvino.genai/tools /llm_bench/benchmark.py
Device: Intel Meteor Lake
Environment: Ubuntu 24.04 / OpenVINO 2024.5.0 / NPU Driver 1.10.0
Model used:
- NPU: llama-3-8b-instruct (INT4-NPU)

When running the following command:

python3 benchmark.py -m /home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama/INT4-NPU_compressed_weights/ -n 2 -d NPU -p "What is large language model (LLM)?" -ic 50

Message:

Run on NPU (After approximately 5 minutes of execution) :

(openvino_one) adv@adv-Default-string:~/Downloads/openvino.genai/tools/llm_bench$ python3 benchmark.py -m /home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama/INT4-NPU_compressed_weights/ -n 2 -d NPU -p "What is large language model (LLM)" -ic 50 The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [ INFO ] ==SUCCESS FOUND==: use_case: text_gen, model_type: llama [ INFO ] OV Config={'CACHE_DIR': ''} [ INFO ] Model path=/home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama/INT4-NPU_compressed_weights, openvino runtime version: 2025.0.0-17709-688f0428cfc Segmentation fault (core dumped)

Dec 25 '24 08:12 tim102187S

@tim102187S I see you listed NPU driver as 1.10.0 version, could you try the latest v1.10.1 version here ? This version contains a fix for LLMs, please give it a go and let us know if issue persists.

Dec 30 '24 19:12 avitial

Hello @avitial , I have updated my NPU driver to latest v1.10.1 version as suggested, but I am still encountering the same issue. Please let me know if there are any additional steps I should take or further details I can provide to assist in resolving this matter.

Jan 06 '25 01:01 tim102187S