Zhihan Jiang
Zhihan Jiang
Thanks Arjun, I think we are okay with the change as long as it doesn't break the behavior of 4.0 and 4.1 existing benchmarks. Have you tested the workloads?
@mrmhodak @pgmpablo157321 can we merge this PR? This is blocking the #1884
@arjunsuresh to help^
I believe the consolidate_results.py is not needed if the pickle input file already has all the samples (24576). That script is a by-product of preprocessing that @nv-alicheng uses IIRC.
We didn't use vLLM when creating Llama2-70B - feel free to use any version that works
@nv-alicheng to review. @pgmpablo157321 how hard would it be to make interactive a third scenario, but use the same code path as server? If it's too complicated we can live...
@pgmpablo157321 I think all the interactive parameters and latencies for llama-405B and 8B are missing in the mlperf.conf: https://github.com/mlcommons/inference/blob/0a3570efb0309b5581f2831d84c05fe5483b5ef7/loadgen/mlperf.conf#L60 Can you help add them?
Also in the existing mlperf.conf, the llama2-interactive still seems like a separate benchmark. Not sure if we can change to llama2-70b.interactive.xxx this round: https://github.com/mlcommons/inference/blob/0a3570efb0309b5581f2831d84c05fe5483b5ef7/loadgen/mlperf.conf#L94
Addressed in https://github.com/mlcommons/inference/pull/1978/files
Seems like the result_token_per_second is in the summary.txt. Not sure why it's not in the details.txt