CUMULATIVE THROUGHPUT performance drop
Hello!
I am working on testing out OpenVINO for multi GPU and am getting really terrible performance. I want to understand why disabling stateful inference causes performance to degrade so severely and am interested in contributing.
WIth Stateful disabled and CUMULATIVE_THROUGHPUT enables I get ~13.77 t/s on 2x Arc A770s.
With stateful enabled and LATENCY I get ~25 t/s on 1x Arc A770.
Test code:
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time
model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ns-ov" # Converted with stateful disabled
# model_dir = "Echo9Zulu/phi-4-int4_asym-awq-se-ov" # Converted with stateful enabled
device = "AUTO:GPU.0,GPU.1" # Here we are using the AUTO plugin prefix
#device = "GPU.1"
ov_config = {
"PERFORMANCE_HINT": "CUMULATIVE_THROUGHPUT", # Cumulative throughput is the high level performance hint for multi gpu
# "PERFORMANCE_HINT": "LATENCY"
}
model = OVModelForCausalLM.from_pretrained(model_dir, device=device, ov_config=ov_config)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
prompt = "This is a test of the multi gpu performance hint.?"
inputs = tokenizer(prompt, return_tensors="pt")
start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
end_time = time.time()
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
# Calculate performance metrics
input_length = len(inputs.input_ids[0])
output_length = len(outputs[0])
new_tokens = output_length - input_length
total_time = end_time - start_time
tokens_per_second = new_tokens / total_time
print(f"Generated {new_tokens} new tokens in {total_time:.2f} seconds")
print(f"Throughput: {tokens_per_second:.2f} tokens/second")
Am I missing something here? Additionally, It's hard to tell on Linux if weights are actually distributed across devices; I can only infer from htop we are not using CPU/system memory. I want to implement support for this runtime feature in my project OpenArc but have not been able to get it working, especially for models whose compressed weights fit into VRAM budget but uncompressed exceed budget.
@SearchSavior stateful is default and recommended scenario for LLMs running, it is expected that stateless models are less performant as there are additional optimizations from openvino side for stateful scenario:
- reducing overhead for large tensors coping from host to device memory and back (I mean past key values transfer)
- opportunity for storing kv cache in device-friendly layouts, precisions, format e.t.c. gives better utilisation and field for some operations fusing for efficiency.
That makes sense. But where does the cpp code for this live? Is it compiled into the binary? Most examples I have seen on git seem to be at an API level, especially since the python api for ov genai points to a cpp api. Will add more later!
For now can you show me what code is missing from this snippet to infer models when stateful is disabled and what hints to use for multi gpu like in Transformers? Some of the docs cause some confusion or lack clear examples for this usecase.
''' import openvino_genai as ov_genai
model_dir = "dump-your-model-here" pipe = ov_genai.LLMPipeline( model_dir, # Path to the model directory "GPU.0", # Define the device to use )
generation_config = ov_genai.GenerationConfig( max_new_tokens=128 )
prompt = "We don't even have a chat template so strap in and let it ride!"
result = pipe.generate([prompt], generation_config=generation_config) perf_metrics = result.perf_metrics
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}') print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms') print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token') print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
print(result) '''
An example would be how to properly use HETERO in this snippet, and whether to use AUTO or MULTI