inference Merge LLM interactive scenarios into the benchmark (as a separate server scenario, instead of a separate benchmark)

As titled. The required changes might be:

LoadGen to support server scenarios with more than 1 set of latency thresholds (TTFT/TPOT)
User to select the latency scenario based on a flag (userSettings probably)?
Accuracy checker to validate the result with the correct threshold
Result table to distinguish the 2 server scenarios

@pgmpablo157321

May 27 '25 16:05 nvzhihanj

@pgmpablo157321 Could you take a look and help figure out whether these changes can be implemented before the v5.1 code freeze before next Tuesday?

May 27 '25 18:05 hanyunfan

@nvzhihanj @hanyunfan I opened a PR with the changes needed for this. No LoadGen changes were needed for this change, we are just adding a scenario in the submission checker and making sure the run for this scenario is a server run with the correct latencies. The submission will look something like this:

results
- llama2-70b
  - Interactive (optional)
    - accuracy
    - performance
      - run1
        
        mlperf_log_accuracy.json
        
        mlperf_log_detail.txt (server run with interactive latencies)
        
        mlperf_log_summary.txt (server run with interactive latencies)
  - Server
  - Offline

Jun 03 '25 01:06 pgmpablo157321

Pablo will run more tests on it

Jun 10 '25 16:06 hanyunfan

NVIDIA’s proposal on 6/17: Choice 1: Datacenter submission has to have Offline + at least one of (Server and Interactive) Choice 2: Datacenter submission has to have Offline and Server (and Interactive is optional extra) MLCommon, prefer Choice1, give 1 week for everyone to think about it.

Jun 20 '25 19:06 hanyunfan

@pgmpablo157321 I think all the interactive parameters and latencies for llama-405B and 8B are missing in the mlperf.conf: https://github.com/mlcommons/inference/blob/0a3570efb0309b5581f2831d84c05fe5483b5ef7/loadgen/mlperf.conf#L60

Can you help add them?

Jun 25 '25 23:06 nvzhihanj

Also in the existing mlperf.conf, the llama2-interactive still seems like a separate benchmark. Not sure if we can change to llama2-70b.interactive.xxx this round: https://github.com/mlcommons/inference/blob/0a3570efb0309b5581f2831d84c05fe5483b5ef7/loadgen/mlperf.conf#L94

Jun 26 '25 00:06 nvzhihanj

Linking this thread to the discussion in PR #2224

Jun 26 '25 14:06 hanyunfan