Use evalscope eval gpqa_diamond never finish
Self-Check List
Before submitting an issue, please ensure you have completed the following steps:
- [x] I have carefully read the relevant user documentation
- [x] I have reviewed the Frequently Asked Questions
- [x] I have searched and reviewed existing issues to confirm this is not a duplicate problem
Problem Description
When evaluating grok-2 on the gpqa_diamond dataset using EvalScope, a few requests never finish. They keep retrying indefinitely and the evaluation gets stuck for a very long time.
Only a small subset of queries is affected, but once it happens, the process does not recover or exit.
EvalScope Version (Required)
evalscope --version 1.2.0
Tools Used
- [x] Native / Native framework
- [ ] Opencompass backend
- [ ] VLMEvalKit backend
- [ ] RAGEval backend
- [ ] Perf / Model inference stress testing tool
- [ ] Arena / Arena mode
Executed Code or Instructions
evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'
Error Log
Hang with last 3 reqs
Predicting[gpqa_diamond@default]: 98%|█████▉| 195/198 [24:00<01:03, 21.26s/it]2025-11-20 08:14:12,655 - openai._base_client - INFO: Retrying request to /chat/completions in 0.959059 seconds
2025-11-20 08:15:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:16:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:17:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:18:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:19:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:20:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:21:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:22:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
2025-11-20 08:23:02 - evalscope - INFO: Predicting[gpqa_diamond@default]: still processing... pending=2
Finished with error
Predicting[gpqa_diamond@default]: 99%|█████▉| 197/198 [34:00<00:10, 10.36s/it]
raise exc from None
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
response = connection.handle_request(
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
return self._connection.handle_request(request)
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
raise exc
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
) = self._receive_response_headers(**kwargs)
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
event = self._receive_event(timeout=timeout)
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
data = self._network_stream.read(
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 126, in read
with map_exceptions(exc_map):
File "/home/gcpuser/miniconda3/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadTimeout: timed out
Running Environment
- Operating System:
- Python Version:
Additional Information
If there is any other relevant information, please provide it here.
The problem is likely caused by the complexity of these specific evaluation questions, which require longer inference time from the model. The default --timeout of OpenAI Python SDK is set to 600 seconds.
Here are two solutions you can try:
-
Increase the timeout value: Add
--timeout <seconds>to allow more time for complex queriesevalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --timeout 1200 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}' -
Skip errors automatically: Add the
--ignore-errorsparameter to skip problematic samples and ensure the evaluation completesevalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --ignore-errors --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'
You can also combine both parameters for better stability. Let me know if this resolves your issue!
The problem is likely caused by the complexity of these specific evaluation questions, which require longer inference time from the model. The default
--timeoutof OpenAI Python SDK is set to 600 seconds.Here are two solutions you can try:
- Increase the timeout value: Add
--timeout <seconds>to allow more time for complex queries evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --timeout 1200 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'- Skip errors automatically: Add the
--ignore-errorsparameter to skip problematic samples and ensure the evaluation completes evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --ignore-errors --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'You can also combine both parameters for better stability. Let me know if this resolves your issue!
Thanks, I use --timeout and short max_tokens to early finish the request