inference icon indicating copy to clipboard operation
inference copied to clipboard

Interactive: still occasionally a separate workload rather than a fully fledged scenario?

Open psyhtest opened this issue 5 months ago • 2 comments

It's confusing that mlperf.conf still uses separate workloads and the Server scenario for the Interactive case.

grep -Ri interactive inference/loadgen/*
docs/src/doxygen.cfg:# enable generation of interactive SVG images that allow zooming and panning.
docs/src/doxygen.cfg:INTERACTIVE_SVG        = YES
mlperf.conf:llama2-70b-interactive.*.performance_sample_count_override = 24576
mlperf.conf:llama3_1-405b-interactive.*.performance_sample_count_override = 8313
mlperf.conf:llama3_1-8b-interactive.*.performance_sample_count_override = 13368
mlperf.conf:llama2-70b-interactive.*.sample_concatenate_permutation = 1
mlperf.conf:llama3_1-405b-interactive.*.sample_concatenate_permutation = 1
mlperf.conf:llama3_1-8b-interactive.*.sample_concatenate_permutation = 1
mlperf.conf:llama2-70b-interactive.*.use_token_latencies = 1
mlperf.conf:llama3_1-405b-interactive.*.use_token_latencies = 1
mlperf.conf:llama3_1-8b-interactive.*.use_token_latencies = 1
mlperf.conf:# Target Latencies for interactive setting
mlperf.conf:llama2-70b-interactive.Server.target_latency = 0
mlperf.conf:llama2-70b-interactive.Server.ttft_latency = 450
mlperf.conf:llama2-70b-interactive.Server.tpot_latency = 40
mlperf.conf:# Target Latencies for interactive setting
mlperf.conf:llama3_1-405b-interactive.Server.target_latency = 0
mlperf.conf:llama3_1-405b-interactive.Server.ttft_latency = 4500
mlperf.conf:llama3_1-405b-interactive.Server.tpot_latency = 80
mlperf.conf:# Target Latencies for interactive setting
mlperf.conf:llama3_1-8b-interactive.Server.target_latency = 0
mlperf.conf:llama3_1-8b-interactive.Server.ttft_latency = 500
mlperf.conf:llama3_1-8b-interactive.Server.tpot_latency = 30

Still, I assume that user.conf for Interactive is expected to look like:

*.Server.target_qps = <target QPS lower than for Server to meet more stringent latency constraints>
*.Server.min_duration = <minimum duration in milliseconds; at least 600,000>

(* here should cover both llama2-70b-interactive and llama2-70b, whichever is correct.)

psyhtest avatar Jul 24 '25 18:07 psyhtest

I'm not sure whether PR #2281 was required. For us, just fixing the missing comma as in PR #2283 seemed sufficient.

psyhtest avatar Jul 25 '25 11:07 psyhtest

@pgmpablo157321 Could you take a look

hanyunfan avatar Sep 30 '25 16:09 hanyunfan