Reproducibility Issues
Hello! I have recently been attempting to reproduce the results show in Table 2 of the LongBenchV2 paper
After cloning your repo and performing the setup I proceeded to evaluate a few OpenAI models and command-r-plus-08-2024 via their respective API's. While testing, I noticed very large deltas between the scores reported in the paper and those being achieved while running the evaluation pipeline live. Below is a table demonstrating the deltas.
Consolidated Model Performance Comparison
| Model | Score Type | Overall | Easy | Hard | Short | Medium | Long |
|---|---|---|---|---|---|---|---|
| GPT-4o-mini-2024-07-18 | Paper | 29.3 | 31.1 | 28.2 | 31.8 | 28.6 | 26.2 |
| GPT-4o-mini-2024-07-18 | Local | 28.7 | 31.2 | 27.1 | 30.6 | 29.0 | 25.0 |
| GPT-4o-mini-2024-07-18 | Delta | -0.6 | +0.1 | -1.1 | -1.2 | +0.4 | -1.2 |
| ------- | ------------ | --------- | ------ | ------ | ------- | --------- | ------ |
| c4ai-command-r-plus-08-2024 | Paper | 27.8 | 30.2 | 26.4 | 36.7 | 23.7 | 21.3 |
| c4ai-command-r-plus-08-2024 | Local | 27.7 | 27.1 | 28.1 | 33.9 | 23.4 | 25.9 |
| c4ai-command-r-plus-08-2024 | Delta | -0.1 | -3.1 | +1.7 | -2.8 | -0.3 | +4.6 |
| ------- | ------------ | --------- | ------ | ------ | ------- | --------- | ------ |
| GPT-4o-2024-11-20 | Paper | 46.0 | 50.8 | 43.0 | 47.5 | 47.9 | 39.8 |
| GPT-4o-2024-11-20 | Local | 47.2 | 51.0 | 44.8 | 47.8 | 49.5 | 41.7 |
| GPT-4o-2024-11-20 | Delta | +1.2 | +0.2 | +1.8 | +0.3 | +1.6 | +1.9 |
| ------- | ------------ | --------- | ------ | ------ | ------- | --------- | ------ |
| GPT-4o-2024-08-06 | Paper | 50.1 | 57.4 | 45.6 | 53.3 | 52.4 | 40.2 |
| GPT-4o-2024-08-06 | Local | 49.0 | 56.8 | 44.2 | 54.4 | 49.1 | 39.8 |
| GPT-4o-2024-08-06 | Delta | -1.1 | -0.6 | -1.4 | +1.1 | -3.3 | -0.4 |
Note: Delta values are calculated as (Local - Paper), so positive values indicate higher local scores, and negative values indicate higher paper scores.
Also worth noting that the C4ai-command-r-plus-08-2024 run was done via calling command-r-plus-08-2024 on the Cohere API and not by running locally.
I was wondering if these deltas are known to the longbenchv2 team? I was unable to find any mention of these types of reproducibility issues in other issues or the paper so this may be isolated to my setup. I would appreciate any help with rectifying this, thank you!
The randomness here is not only caused by the random seed.
If not explicitly set, the default random seed is 0, so it is reasonable that the results may differ from the ones with the random seed set to 42, as you mentioned.
However, even with the same seed, the results of two experiments may still be inconsistent. This is because, under the default settings, the Longbenchv2 code uses multithreading and caching. When making multithreaded requests, the call order may differ each time the experiment runs. This difference in the call order can lead to non-deterministic behavior, even with the same random seed, especially when using temperature=0.1. The use of cache also impacts the results; if the previous run saved some result files, they will be read again in the next run, affecting the order of the calls.
My suggestion is to fix the random seed for each experiment. Before starting each experiment, make sure to turn off the previously running vllm server, and restart it. In the experiment setup, use --n_proc=1 and avoid using cached files. With this setup, even when using temperature=0.1 for non-greedy decoding, the results can be consistent across multiple runs in my experimental settings.