Zhihan Jiang

Results 13 issues of Zhihan Jiang

@attafosu @pgmpablo157321 please review and merge this.

In v4.0 submission, we found in the **server** log that "result_token_throughput" is not reported properly, and most of them are at the e-09 scale (@pgmpablo157321 feel free to to check...

As presented in https://docs.google.com/presentation/d/1Y_AKEJ6h1g5k3ntrL7nTazWw3xVDzJ_tjOGkLQ6VDMI/edit?usp=sharing the completed sample per second is a better representation of the throughput than scheduled QPS. @pgmpablo157321 to help implement after the conclusion of v4.0

The current C++ code base follows C and fortran coding style, which is a bit stale given that C++14 is used. We would like to clang-format all the C++ code...

Python module disallows usage of hyphen ('-') in module name, and it makes importing and module run very complicated. We should change the naming of folder and module (e.g. llama2-70b)...

inference v5.0

WIth the increasing number of benchmarks and checks, we have found several issues with the submission checker (https://github.com/mlcommons/inference/blob/master/tools/submission/submission_checker.py): - The file itself is too long (>3700 in loc), which makes...

We propose to add some basic unit test framework (likely pytest) and tests to the inference repo. Ideally, it should test: - All configuration (mlperf.conf, user.conf) is valid and working...

postmortem 4.1

@pgmpablo157321 One of our submission results for singlestream was having wrong result showing in the final table. It should be showing 90-perc latency, but actually showing 97-perc latency. ![image](https://github.com/user-attachments/assets/bcdc1d35-7cfb-46d1-a274-86b65fbcb95c) ![image](https://github.com/user-attachments/assets/350add63-3787-4067-b686-32f2c487cb4d)

There are 4 samples in the reference HF output that has no output other than the EOS. ``` >>> df = pd.read_pickle("06062024_mixtral_15k_v4.pkl") >>> df[df['tok_ref_output_len'] == 1] dataset id question input...

inference v5.0

As titled. The required changes might be: - LoadGen to support server scenarios with more than 1 set of latency thresholds (TTFT/TPOT) - User to select the latency scenario based...