Zhihan Jiang
Zhihan Jiang
@attafosu @pgmpablo157321 please review and merge this.
In v4.0 submission, we found in the **server** log that "result_token_throughput" is not reported properly, and most of them are at the e-09 scale (@pgmpablo157321 feel free to to check...
As presented in https://docs.google.com/presentation/d/1Y_AKEJ6h1g5k3ntrL7nTazWw3xVDzJ_tjOGkLQ6VDMI/edit?usp=sharing the completed sample per second is a better representation of the throughput than scheduled QPS. @pgmpablo157321 to help implement after the conclusion of v4.0
The current C++ code base follows C and fortran coding style, which is a bit stale given that C++14 is used. We would like to clang-format all the C++ code...
Python module disallows usage of hyphen ('-') in module name, and it makes importing and module run very complicated. We should change the naming of folder and module (e.g. llama2-70b)...
WIth the increasing number of benchmarks and checks, we have found several issues with the submission checker (https://github.com/mlcommons/inference/blob/master/tools/submission/submission_checker.py): - The file itself is too long (>3700 in loc), which makes...
We propose to add some basic unit test framework (likely pytest) and tests to the inference repo. Ideally, it should test: - All configuration (mlperf.conf, user.conf) is valid and working...
@pgmpablo157321 One of our submission results for singlestream was having wrong result showing in the final table. It should be showing 90-perc latency, but actually showing 97-perc latency.  
There are 4 samples in the reference HF output that has no output other than the EOS. ``` >>> df = pd.read_pickle("06062024_mixtral_15k_v4.pkl") >>> df[df['tok_ref_output_len'] == 1] dataset id question input...
As titled. The required changes might be: - LoadGen to support server scenarios with more than 1 set of latency thresholds (TTFT/TPOT) - User to select the latency scenario based...