Anton Lokhmotov

Results 273 comments of Anton Lokhmotov

I'm a bit concerned about enabling these tests for Edge systems. Looking at [the v4.0 Edge results](https://mlcommons.org/benchmarks/inference-edge), the SingleStream latency ranged from 2 to 13 seconds per sample. LoadGen seems...

For Datacenter, the Offline throughput ranged from 1.18 QPS to 13.71 QPS. That's up to 75 minutes per single Performance run.

Despite us having not decided on this issue, the submission checker already complains about missing TEST04 and TEST05, when the main results and TEST01 are present. I've done a little...

> > > Provided as a log for CUDA devices > > > > > > Uh? How can an auto-scaler use a log message to "determine the pod autoscaling...

> The gen_len here is not the tokens but the characters I believe. Llama2-70b computes the [`gen_tok_len`](https://github.com/mlcommons/inference/blob/master/language/llama2-70b/evaluate-accuracy.py#L107) which is then used to compute the tokens per sample Thanks @attafosu. For...

The main question remains: Why is the maximum number of output tokens fixed at 128? From what we see, the model "wants" to "say" more in practically every case, but...

> From the summary, there's about 4% of ground truth lengths > 128 It looks that close to 100% of generated lengths is > 128, that is the model is...

Perhaps [this](https://www.linkedin.com/posts/tigranbayburtsyan_if-you-havent-tried-threatening-llms-in-activity-7328882173820792833-5Sie/) might be helpful? > If you haven't tried "threatening" LLMs in system prompts, then you should!

@sahelib25 Can we try please with the reference with max tokens set to e.g. 256?