GenAIExamples
GenAIExamples copied to clipboard

Published 20 hours ago •

Reame
Issues

[Bug] - CodeGen: Gets stuck in a loop, generating the same output until the maximum token limit is reached, (Model was: CodeLlama)

Open hsyrjaos opened this issue 6 months ago • 7 comments

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-SPR

Installation method

[x] Pull docker images from hub.docker.com
[ ] Build docker images from source
[ ] Other
[ ] N/A

Deploy method

[ ] Docker
[x] Docker Compose
[ ] Kubernetes Helm Charts
[ ] Kubernetes GMC
[ ] Other
[ ] N/A

Running nodes

Single Node

What's the version?

NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS codegen-xeon-backend-server opea/codegen:latest "python codegen.py" codegen-xeon-backend-server 9 minutes ago Up 8 minutes 0.0.0.0:7778->7778/tcp, :::7778->7778/tcp codegen-xeon-ui-server opea/codegen-gradio-ui:latest "python codegen_ui_g…" codegen-xeon-ui-server 9 minutes ago Up 8 minutes 0.0.0.0:5173->5173/tcp, :::5173->5173/tcp dataprep-redis-server opea/dataprep:latest "sh -c 'python $( [ …" dataprep-redis-server 9 minutes ago Up 9 minutes (healthy) 0.0.0.0:6007->5000/tcp, [::]:6007->5000/tcp llm-codegen-vllm-server opea/llm-textgen:latest "bash entrypoint.sh" llm-vllm-service 9 minutes ago Up 8 minutes 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp llm-textgen-server opea/llm-textgen:latest "bash entrypoint.sh" llm-base 9 minutes ago Up 9 minutes redis-vector-db redis/redis-stack:7.2.0-v9 "/entrypoint.sh" redis-vector-db 9 minutes ago Up 9 minutes 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 0.0.0.0:8001->8001/tcp, :::8001->8001/tcp retriever-redis opea/retriever:latest "python opea_retriev…" retriever-redis 9 minutes ago Up 9 minutes 0.0.0.0:7000->7000/tcp, :::7000->7000/tcp tei-embedding-server opea/embedding:latest "sh -c 'python $( [ …" tei-embedding-server 9 minutes ago Up 9 minutes 0.0.0.0:6000->6000/tcp, :::6000->6000/tcp tei-embedding-serving ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 "/bin/sh -c 'apt-get…" tei-embedding-serving 9 minutes ago Up 9 minutes (healthy) 0.0.0.0:8090->80/tcp, [::]:8090->80/tcp vllm-server opea/vllm:latest "python3 -m vllm.ent…" vllm-service 9 minutes ago Up 9 minutes (healthy) 0.0.0.0:8028->80/tcp, [::]:8028->80/tcp

Description

model = "codellama/CodeLlama-7b-Python-hf"

command: "curl http://:7778/v1/codegen -H "Content-Type: application/json" -d '{"messages": "Write a Python function that generates fibonnaci sequence."}'"

Run also with Gradio UI same behavior.

Gets stuck in a loop, generating the same output until the maximum token limit is reached.

this happens very often with codellama model but not every time.

See output of run 1: Prompt: "Write a Python function that generates fibonnaci sequence" Output: " up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a, end=" ") c = a + b a = b b = c

fibonacci(10) " See output of run 2: Prompt: "Write a Python function that generates fibonnaci sequence" Output: " up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)

Write a Python function that generates fibonnaci sequence up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)

Write a Python function that generates fibonnaci sequence up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)

Write a Python function that generates fibonnaci sequence up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)

Write a Python function that generates fibonnaci sequence up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)

Write a Python function that generates fibonnaci sequence up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)

Write a Python function that generates fibonnaci sequence up to n numbers.

def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b

fibonacci(10)"

....

Reproduce steps

"curl http://:7778/v1/codegen -H "Content-Type: application/json" -d '{"messages": "Write a Python function that generates fibonnaci sequence."}'"

Raw log

Attachments

No response

May 14 '25 13:05 hsyrjaos

It's the output quality issue of AI models.

From the workflow side, we cannot change the results. My suggestion is to use a better model, I recommend to try Qwen model "Qwen/Qwen2.5-Coder-7B-Instruct".

May 16 '25 02:05 xiguiw

While Qwen is a good model, we need to make sure other models also work with codegen in OPEA

May 16 '25 12:05 Padmaapparao

Could the CodeLlama bug be an old one .. https://stackoverflow.com/questions/76772509/llama-2-7b-hf-repeats-context-of-question-directly-from-input-prompt-cuts-off-w#:~:text=To%20address%20this%20issue%2C%20you%20need%20a%20way,find%20the%20length%20of%20the%20prompt%20token%20ids and https://github.com/meta-llama/codellama/issues/89

We might need some model specific code if there is no consistent API. Wonder if LLama Stack SDK has something to tackle this.

May 19 '25 11:05 mkbhanda

set return_full_text=False to only get generated text (and not the input + generated text).

Would that break things if used also with e.g. Qwen?

May 19 '25 14:05 eero-t

set return_full_text=False to only get generated text (and not the input + generated text).

Would that break things if used also with e.g. Qwen?

return_full_text=False is one of the input of HuggingFacePipeline. OPEA does not use this API. Micro-service is a services, so TGI/vLLM is used.

May 20 '25 01:05 xiguiw

We might need some model specific code if there is no consistent API. Wonder if LLama Stack SDK has something to tackle this. https://github.com/meta-llama/codellama/issues

codellama/CodeLlama-7b-Python-hf such model are finetuned from base model for transfer from nutural language to (computer) code generation. It's code generation, so you input prompt should be some source code, then the model generates next source code (tokens). If your prompt are instructions, e.g, Write a Python function that generates fibonnaci sequence", the context is not a source code. The model is not instruction following model. So it's behavior may not be as you expected. Overview the codellama issue, you can see reported random output, cannot stop untill reach the end of the Max-tokens etc.

You can try model with Instruct tuned, where the model are trained with instructions input, the output the code. I expect the model behaves better. BTW, Qwen 3 are instruct following even without "instruct" word in model name.

May 20 '25 01:05 xiguiw

Thanks for the reply. I tried the Instruct 7B model—it seems to give slightly better answers, but it still tends to get stuck in a loop, repeating parts of the response until the max token limit is reached. According to the Hugging Face webpage, the Instruct model is intended to support also chat mode, whereas the others are designed for code completion. I'll try out some of the larger instruct models as well.

May 21 '25 12:05 hsyrjaos

I root cause the issue.

It's the CodeLLama model input prompt template cause this issue.

Here is my answer to the question:

1. Why Codellama does not stable.

Codellama model is code completion/Infilling, It's NOT Instruction fine-tuning. So it does NOT output as your expectation for instructions like Write a Python function that generates fibonnaci sequence" It works in the condition of context of (input) source code.

2. Why Codellama-instruct model generating the same output until the maximum token limit is reached?

I investigated it. The conclusion is the CodeLllama model Serices are sensitive to the prompt (template format).

I reproduced the issue with HuggingFace transformers API interface with CodeLllama and CodeLllama-instruct model. However, it cannot be reproduced with Huggingface pipeline interface.

From the OPEA Interface, I did some experiments:

a. reproduce the issue. output KO,

curl http://${HOST_IP}:7778/v1/codegen \
  -H "Content-Type: application/json" \
  -d '{"messages": "Write a Python function that generates fibonnaci sequence."}'

Ablation analyses

After checking all the parameters between transformers API and OPEA interface, they does not the cause of the issue. (presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.01, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None)

Tried the LLM interfaces: b. reproduce the issue. output KO

curl http://${HOST_IP}:9000/v1/chat/completions \
   -X POST \
   -H 'Content-Type: application/json' \
   -d '{"model": "meta-llama/CodeLlama-7b-Instruct-hf", "messages": "Write a Python function that generates fibonnaci sequence.", "max_tokens":1024}'

c. Canont reproduce the issue. output OK,

curl http://${HOST_IP}:9000/v1/chat/completions \                  
   -X POST \
   -H 'Content-Type: application/json' \
   -d '{"model": "meta-llama/CodeLlama-7b-Instruct-hf", "messages": [{"role": "user", "content": "Write a Python function that generates fibonnaci sequence."}], "max_tokens":1024}'

Ablation analyses

Then I set the message format as above item c with transformer API interface, the output seems better but not as expected. prompt="{"role": "user", "content": "Write a Python function that generates fibonnaci sequence.}"

So I confirmed the Codel-LLama are sensitive to the prompt. I cannot find such information from [huggingface model cards] (https://huggingface.co/meta-llama/CodeLlama-7b-Instruct-hf), https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code at all.

checking the log of vLLM, the intput prompt OK and KO are different are the input prompt:

[{'role': 'user', 'content': 'Write a Python function that generates fibonnaci sequence.'}]

vLLM service log: OK

INFO 05-23 10:51:30 [logger.py:39] Received request chatcmpl-acbeabc6c24d4aee899b3d94707905b4: prompt: '<s>[INST] Write a Python function that generates fibonnaci sequence. [/INST]', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.01, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.

vLLM service log: KO

“INFO 05-23 10:47:58 [logger.py:39] Received request cmpl-7f916f8e093f4869bff6be32f7447c32-0: prompt: 'Write a Python function that generates fibonnaci sequence.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.75, top_p=0.95, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 14350, 263, 5132, 740, 393, 16785, 18755, 11586, 455, 5665, 29889], lora_request: None, prompt_adapter_request: None.”

Try the vLLM prompt message in Transformers API interface, set prompt as this format "<s>[INST] Messages [/INST], the output are OK now.

According to prompt-template founding, @wangkl2 founds out this blog, it confirms to our conclusion: https://statics.teams.cdn.office.net/evergreen-assets/safelinks/1/atp-safelinks.html

Conclusion:

Codellama mode is code completion/Infilling, It's NOT Instruction fine-tuning.
CodeLllama-instruct model does not works as expected because code-llama model serices are sensitive to the prompt (template format).
This issue will be fixed soon.

About the CodeLlama model

It is not recommended to use CodeLlama model. The quality (score) of code-llama is not good.

Please refer to the huggingface bigcode-models-leaderboard. Be careful about the instruct setting: some instruct models are not name with 'instruct', when you choose 'instruct', it does list all scores of all instruction-following models.

We recommend Qwen model.

May 26 '25 03:05 xiguiw

It is not recommended to use CodeLlama model. The quality (score) of code-llama is not good.

Please refer to the huggingface bigcode-models-leaderboard. Be careful about the instruct setting: some instruct models are not name with 'instruct', when you choose 'instruct', it does list all scores of all instruction-following models.

We recommend Qwen model.

FYR: Another benchmark results https://hub.athina.ai/blogs/top-open-source-models-for-code-generation-in-2025

May 26 '25 09:05 xiguiw