GenAIExamples
GenAIExamples copied to clipboard
[CodeGen] Aligned the output format and fixed acc benchmark issues.
Description
- Fixed the output format issue.
- Validated the acc benchmark pipeline.
- Refined README
Issues
- Fixed the issue. https://github.com/opea-project/GenAIEval/issues/292
Type of change
List the type of change like below. Please delete options that are not relevant.
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds new functionality)
- [ ] Breaking change (fix or feature that would break existing design and interface)
- [ ] Others (enhancement, documentation, validation, etc.)
Dependencies
N/A
Tests
test screenshots:
Dependency Review
✅ No vulnerabilities or license issues found.Scanned Files
None
ready for merge. @lvliang-intel @ZePan110
How can I try this fix?
How can I try this fix?
You can apply the change of CodeGen/codegen.py in the PR to your code, rebuild the megaservice docker image (opea/codegen:latest) then restart the services with docker compose, and try again.
I've verified that this PR fixed the output format issue originally, the accuracy benchmark could run smoothly to the end for Qwen/Qwen2.5-Coder-7B-Instruct and Qwen/CodeQwen1.5-7B-Chat. But there are still some issues requiring the fix:
- The
pass@1score ofQwen/Qwen2.5-Coder-7B-Instruct(which our the default model for docker compose deployment) on the HumanEval task via the CodeGen service endpoint is only 18%-20%. But according to the official Qwen2.5-coder-family blog, the score of HumanEval forQwen/Qwen2.5-Coder-7B-Instructis 88.4% while the one forQwen/Qwen2.5-Coder-32B-Instructis 92.7%. - I cannot reproduce the result for
Qwen/Qwen2.5-Coder-32B-Instructhere, it does not finish in 10 minutes and it exits with timeout error during the acc benchmark with only 48% progress. Increasing the response timeout threshold from 600s to 3600s in L97, all 164 programming requests complete, but during result parsing it still errors out withjson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0). Log inllm-codegen-vllm-server:
[2025-05-28 09:51:15,005] [ ERROR] - llm - Error during LLM invocation: Request timed out.
INFO: 172.18.0.1:42952 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
yield
File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
resp = await self._pool.handle_async_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
raise exc from None
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
response = await connection.handle_async_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection.py", line 103, in handle_async_request
return await self._connection.handle_async_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 136, in handle_async_request
raise exc
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 106, in handle_async_request
) = await self._receive_response_headers(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 177, in _receive_response_headers
event = await self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 217, in _receive_event
data = await self._network_stream.read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 32, in read
with map_exceptions(exc_map):
File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadTimeout
...
File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1595, in _request
raise APITimeoutError(request=request) from err
openai.APITimeoutError: Request timed out.
- For
Qwen/CodeQwen1.5-7B-Chat, which is the reference model in readme of the codegen benchmark, the output score (76.2%) is higher than the reference.