inference icon indicating copy to clipboard operation
inference copied to clipboard

Merge LLM interactive scenarios into the benchmark (as a separate server scenario, instead of a separate benchmark)

Open nvzhihanj opened this issue 7 months ago • 7 comments

As titled. The required changes might be:

  • LoadGen to support server scenarios with more than 1 set of latency thresholds (TTFT/TPOT)
  • User to select the latency scenario based on a flag (userSettings probably)?
  • Accuracy checker to validate the result with the correct threshold
  • Result table to distinguish the 2 server scenarios

@pgmpablo157321

nvzhihanj avatar May 27 '25 16:05 nvzhihanj

@pgmpablo157321 Could you take a look and help figure out whether these changes can be implemented before the v5.1 code freeze before next Tuesday?

hanyunfan avatar May 27 '25 18:05 hanyunfan

@nvzhihanj @hanyunfan I opened a PR with the changes needed for this. No LoadGen changes were needed for this change, we are just adding a scenario in the submission checker and making sure the run for this scenario is a server run with the correct latencies. The submission will look something like this:

  • results
    • llama2-70b
      • Interactive (optional)
        • accuracy
        • performance
          • run1
            • mlperf_log_accuracy.json
            • mlperf_log_detail.txt (server run with interactive latencies)
            • mlperf_log_summary.txt (server run with interactive latencies)
      • Server
      • Offline

pgmpablo157321 avatar Jun 03 '25 01:06 pgmpablo157321

Pablo will run more tests on it

hanyunfan avatar Jun 10 '25 16:06 hanyunfan

NVIDIA’s proposal on 6/17: Choice 1: Datacenter submission has to have Offline + at least one of (Server and Interactive) Choice 2: Datacenter submission has to have Offline and Server (and Interactive is optional extra) MLCommon, prefer Choice1, give 1 week for everyone to think about it.

hanyunfan avatar Jun 20 '25 19:06 hanyunfan

@pgmpablo157321 I think all the interactive parameters and latencies for llama-405B and 8B are missing in the mlperf.conf: https://github.com/mlcommons/inference/blob/0a3570efb0309b5581f2831d84c05fe5483b5ef7/loadgen/mlperf.conf#L60

Can you help add them?

nvzhihanj avatar Jun 25 '25 23:06 nvzhihanj

Also in the existing mlperf.conf, the llama2-interactive still seems like a separate benchmark. Not sure if we can change to llama2-70b.interactive.xxx this round: https://github.com/mlcommons/inference/blob/0a3570efb0309b5581f2831d84c05fe5483b5ef7/loadgen/mlperf.conf#L94

nvzhihanj avatar Jun 26 '25 00:06 nvzhihanj

Linking this thread to the discussion in PR #2224

hanyunfan avatar Jun 26 '25 14:06 hanyunfan