Zhihan Jiang issues

Results 13 issues of


                                            Zhihan Jiang

Add open_orca preprocessing steps

@attafosu @pgmpablo157321 please review and merge this.

Llama2 LoadGen server mode: TPS not reported properly

In v4.0 submission, we found in the **server** log that "result_token_throughput" is not reported properly, and most of them are at the e-09 scale (@pgmpablo157321 feel free to to check...

MLPerf-Inference Server: change metrics from scheduled to completed samples per second for all benchmarks

As presented in https://docs.google.com/presentation/d/1Y_AKEJ6h1g5k3ntrL7nTazWw3xVDzJ_tjOGkLQ6VDMI/edit?usp=sharing the completed sample per second is a better representation of the throughput than scheduled QPS. @pgmpablo157321 to help implement after the conclusion of v4.0

Proposal: run clang-format and autopep8 on the existing code base; potentially add pre-commit hook for code style checking

The current C++ code base follows C and fortran coding style, which is a bit stale given that C++14 is used. We would like to clang-format all the C++ code...

Proposal: Change all module/folder name from hyphen to underscore

Python module disallows usage of hyphen ('-') in module name, and it makes importing and module run very complicated. We should change the naming of folder and module (e.g. llama2-70b)...

inference v5.0

Submission checker: better modularize the code, and have better documentation of the expectation of results

WIth the increasing number of benchmarks and checks, we have found several issues with the submission checker (https://github.com/mlcommons/inference/blob/master/tools/submission/submission_checker.py): - The file itself is too long (>3700 in loc), which makes...

[4.1 postmortem] Unit test module for inference repo

We propose to add some basic unit test framework (likely pytest) and tests to the inference repo. Ideally, it should test: - All configuration (mlperf.conf, user.conf) is valid and working...

postmortem 4.1

Submission checker results for SingleStream using 97-percentile results instead of 90-percentile results

@pgmpablo157321 One of our submission results for singlestream was having wrong result showing in the final table. It should be showing 90-perc latency, but actually showing 97-perc latency. ![image](https://github.com/user-attachments/assets/bcdc1d35-7cfb-46d1-a274-86b65fbcb95c) ![image](https://github.com/user-attachments/assets/350add63-3787-4067-b686-32f2c487cb4d)

[Urgent] Mixtral MOE dataset has reference output token length == 0 (or 1 if you count EOS)

There are 4 samples in the reference HF output that has no output other than the EOS. ``` >>> df = pd.read_pickle("06062024_mixtral_15k_v4.pkl") >>> df[df['tok_ref_output_len'] == 1] dataset id question input...

inference v5.0

Merge LLM interactive scenarios into the benchmark (as a separate server scenario, instead of a separate benchmark)

As titled. The required changes might be: - LoadGen to support server scenarios with more than 1 set of latency thresholds (TTFT/TPOT) - User to select the latency scenario based...