aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Improving benchmarking scripts with real prompts in heterogenous GPU story

Open nwangfw opened this issue 1 week ago • 0 comments

🐛 Describe the bug

In our current gpu-benchmarking scripts, we always use the prompt "Hi Hi Hi ..." to test model performance. deepeseek-coder-7b model always returns a long enough response until the maximum length. Therefore, we can simply generating desired length of input by changing number of "Hi" in our requests and usemax-length parameter in query to get the desired output length.

Image

However, we found that this benchmarking method doesn't work in the 33b model, which returns a very short response for such a prompt, which means our current benchmarking strategy is no longer working.

We need to improve our exiting benchmarking script to make it general enough to work on any model. The current idea is that

  1. We need to create a dataset and send all prompts there to the model and records their corresponding response length.
  2. Write program to filter different input-output pattern prompts from them and use the filtered prompts for benchmarking tests.
  3. Automate above process and run it before we run our current benchmarking script.

Steps to Reproduce

Image

Expected behavior

We expect to use real prompts with different input-output pattern for benchmarking tests.

Environment

-LLM used: deepseek-coder-33b

nwangfw avatar Feb 20 '25 22:02 nwangfw