GenAIExamples icon indicating copy to clipboard operation
GenAIExamples copied to clipboard

[CodeGen] Aligned the output format and fixed acc benchmark issues.

Open Zhenzhong1 opened this issue 6 months ago • 2 comments

Description

  • Fixed the output format issue.
  • Validated the acc benchmark pipeline.
  • Refined README

Issues

  • Fixed the issue. https://github.com/opea-project/GenAIEval/issues/292

Type of change

List the type of change like below. Please delete options that are not relevant.

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds new functionality)
  • [ ] Breaking change (fix or feature that would break existing design and interface)
  • [ ] Others (enhancement, documentation, validation, etc.)

Dependencies

N/A

Tests

test screenshots:

image

image

Zhenzhong1 avatar May 21 '25 04:05 Zhenzhong1

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

None

github-actions[bot] avatar May 21 '25 04:05 github-actions[bot]

ready for merge. @lvliang-intel @ZePan110

Zhenzhong1 avatar May 21 '25 09:05 Zhenzhong1

How can I try this fix?

hsyrjaos avatar May 27 '25 11:05 hsyrjaos

How can I try this fix?

You can apply the change of CodeGen/codegen.py in the PR to your code, rebuild the megaservice docker image (opea/codegen:latest) then restart the services with docker compose, and try again.

wangkl2 avatar May 27 '25 13:05 wangkl2

I've verified that this PR fixed the output format issue originally, the accuracy benchmark could run smoothly to the end for Qwen/Qwen2.5-Coder-7B-Instruct and Qwen/CodeQwen1.5-7B-Chat. But there are still some issues requiring the fix:

  • The pass@1 score of Qwen/Qwen2.5-Coder-7B-Instruct (which our the default model for docker compose deployment) on the HumanEval task via the CodeGen service endpoint is only 18%-20%. But according to the official Qwen2.5-coder-family blog, the score of HumanEval for Qwen/Qwen2.5-Coder-7B-Instruct is 88.4% while the one for Qwen/Qwen2.5-Coder-32B-Instruct is 92.7%.
  • I cannot reproduce the result for Qwen/Qwen2.5-Coder-32B-Instruct here, it does not finish in 10 minutes and it exits with timeout error during the acc benchmark with only 48% progress. Increasing the response timeout threshold from 600s to 3600s in L97, all 164 programming requests complete, but during result parsing it still errors out with json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0). Log in llm-codegen-vllm-server:
[2025-05-28 09:51:15,005] [   ERROR] - llm - Error during LLM invocation: Request timed out.
INFO:     172.18.0.1:42952 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
    raise exc from None
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/connection.py", line 103, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 136, in handle_async_request
    raise exc
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 106, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 177, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_async/http11.py", line 217, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 32, in read
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout
...
  File "/usr/local/lib/python3.11/site-packages/openai/_base_client.py", line 1595, in _request
    raise APITimeoutError(request=request) from err
openai.APITimeoutError: Request timed out.
  • For Qwen/CodeQwen1.5-7B-Chat, which is the reference model in readme of the codegen benchmark, the output score (76.2%) is higher than the reference.

wangkl2 avatar May 27 '25 13:05 wangkl2