optimum-intel icon indicating copy to clipboard operation
optimum-intel copied to clipboard

CUMULATIVE THROUGHPUT performance drop

Open SearchSavior opened this issue 9 months ago • 3 comments

Hello!

I am working on testing out OpenVINO for multi GPU and am getting really terrible performance. I want to understand why disabling stateful inference causes performance to degrade so severely and am interested in contributing.

WIth Stateful disabled and CUMULATIVE_THROUGHPUT enables I get ~13.77 t/s on 2x Arc A770s.

With stateful enabled and LATENCY I get ~25 t/s on 1x Arc A770.

Test code:

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time


model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ns-ov" # Converted with stateful disabled
# model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ov" # Converted with stateful enabled 

device = "AUTO:GPU.0,GPU.1" # Here we are using the AUTO plugin prefix
#device = "GPU.1"

ov_config = {
    "PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT", # Cumulative throughput is the high level performance hint for multi gpu
   # "PERFORMANCE_HINT": "LATENCY"
}

model = OVModelForCausalLM.from_pretrained(model_dir, device=device, ov_config=ov_config)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompt = "This is a test of the multi gpu performance hint.?"

inputs = tokenizer(prompt, return_tensors="pt")


start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
end_time = time.time()

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

# Calculate performance metrics
input_length = len(inputs.input_ids[0])
output_length = len(outputs[0])
new_tokens = output_length - input_length
total_time = end_time - start_time
tokens_per_second = new_tokens / total_time

print(f"Generated {new_tokens} new tokens in {total_time:.2f} seconds")
print(f"Throughput: {tokens_per_second:.2f} tokens/second")

Am I missing something here? Additionally, It's hard to tell on Linux if weights are actually distributed across devices; I can only infer from htop we are not using CPU/system memory. I want to implement support for this runtime feature in my project OpenArc but have not been able to get it working, especially for models whose compressed weights fit into VRAM budget but uncompressed exceed budget.

SearchSavior avatar Mar 18 '25 02:03 SearchSavior

@SearchSavior stateful is default and recommended scenario for LLMs running, it is expected that stateless models are less performant as there are additional optimizations from openvino side for stateful scenario:

  1. reducing overhead for large tensors coping from host to device memory and back (I mean past key values transfer)
  2. opportunity for storing kv cache in device-friendly layouts, precisions, format e.t.c. gives better utilisation and field for some operations fusing for efficiency.

eaidova avatar Mar 18 '25 04:03 eaidova

That makes sense. But where does the cpp code for this live? Is it compiled into the binary? Most examples I have seen on git seem to be at an API level, especially since the python api for ov genai points to a cpp api. Will add more later!

For now can you show me what code is missing from this snippet to infer models when stateful is disabled and what hints to use for multi gpu like in Transformers? Some of the docs cause some confusion or lack clear examples for this usecase.

''' import openvino_genai as ov_genai

model_dir = "dump-your-model-here" pipe = ov_genai.LLMPipeline( model_dir, # Path to the model directory "GPU.0", # Define the device to use )

generation_config = ov_genai.GenerationConfig( max_new_tokens=128 )

prompt = "We don't even have a chat template so strap in and let it ride!"

result = pipe.generate([prompt], generation_config=generation_config) perf_metrics = result.perf_metrics

print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}') print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms') print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token') print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')

print(result) '''

SearchSavior avatar Mar 18 '25 16:03 SearchSavior

An example would be how to properly use HETERO in this snippet, and whether to use AUTO or MULTI

SearchSavior avatar Mar 18 '25 16:03 SearchSavior