eval-scope Use evalscope eval gpqa

Self-Check List

Before submitting an issue, please ensure you have completed the following steps:

[x] I have carefully read the relevant user documentation
[x] I have reviewed the Frequently Asked Questions
[x] I have searched and reviewed existing issues to confirm this is not a duplicate problem

Problem Description

When evaluating grok-2 on the gpqa_diamond dataset using EvalScope, a few requests never finish. They keep retrying indefinitely and the evaluation gets stuck for a very long time.

Only a small subset of queries is affected, but once it happens, the process does not recover or exit.

EvalScope Version (Required)

evalscope --version 1.2.0

Tools Used

[x] Native / Native framework
[ ] Opencompass backend
[ ] VLMEvalKit backend
[ ] RAGEval backend
[ ] Perf / Model inference stress testing tool
[ ] Arena / Arena mode

Executed Code or Instructions

evalscope eval  --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions  --api-key EMPTY  --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

Error Log

Hang with last 3 reqs

Predicting[gpqa_diamond@default]:  98%|█████▉| 195/198 [24:00<01:03, 21.26s/it]2025-11-20 08:14:12,655 - openai._base_client - INFO: Retrying request to /chat/completions in 0.959059 seconds
2025-11-20 08:15:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:16:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:17:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:18:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:19:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:20:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:21:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:22:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2
2025-11-20 08:23:02 - evalscope - INFO: Predicting[gpqa_diamond@default]:  still processing... pending=2

Finished with error

Predicting[gpqa_diamond@default]:  99%|█████▉| 197/198 [34:00<00:10, 10.36s/it]

    raise exc from None
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
    return self._connection.handle_request(request)
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
    raise exc
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
    ) = self._receive_response_headers(**kwargs)
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
    data = self._network_stream.read(
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 126, in read
    with map_exceptions(exc_map):
  File "/home/gcpuser/miniconda3/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/gcpuser/.local/share/uv/tools/evalscope/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout: timed out

Running Environment

Operating System:
Python Version:

Additional Information

If there is any other relevant information, please provide it here.

Nov 20 '25 08:11 JamesBrianD

The problem is likely caused by the complexity of these specific evaluation questions, which require longer inference time from the model. The default --timeout of OpenAI Python SDK is set to 600 seconds.

Here are two solutions you can try:

Increase the timeout value: Add --timeout <seconds> to allow more time for complex queries

evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --timeout 1200 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

Skip errors automatically: Add the --ignore-errors parameter to skip problematic samples and ensure the evaluation completes

evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --ignore-errors --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

You can also combine both parameters for better stability. Let me know if this resolves your issue!

Nov 21 '25 07:11 Yunnglin

The problem is likely caused by the complexity of these specific evaluation questions, which require longer inference time from the model. The default --timeout of OpenAI Python SDK is set to 600 seconds.

Here are two solutions you can try:

Increase the timeout value: Add --timeout <seconds> to allow more time for complex queries evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --timeout 1200 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

Skip errors automatically: Add the --ignore-errors parameter to skip problematic samples and ensure the evaluation completes evalscope eval --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gpqa_diamond --eval-batch-size 16 --ignore-errors --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

You can also combine both parameters for better stability. Let me know if this resolves your issue!

Thanks, I use --timeout and short max_tokens to early finish the request

Nov 21 '25 08:11 JamesBrianD

Use evalscope eval gpqa_diamond never finish

Self-Check List

Problem Description

EvalScope Version (Required)

Tools Used

Executed Code or Instructions

Error Log

Running Environment

Additional Information