GenAIExamples
GenAIExamples copied to clipboard
[Bug] - CodeGen: Gets stuck in a loop, generating the same output until the maximum token limit is reached, (Model was: CodeLlama)
Priority
P1-Stopper
OS type
Ubuntu
Hardware type
Xeon-SPR
Installation method
- [x] Pull docker images from hub.docker.com
- [ ] Build docker images from source
- [ ] Other
- [ ] N/A
Deploy method
- [ ] Docker
- [x] Docker Compose
- [ ] Kubernetes Helm Charts
- [ ] Kubernetes GMC
- [ ] Other
- [ ] N/A
Running nodes
Single Node
What's the version?
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS codegen-xeon-backend-server opea/codegen:latest "python codegen.py" codegen-xeon-backend-server 9 minutes ago Up 8 minutes 0.0.0.0:7778->7778/tcp, :::7778->7778/tcp codegen-xeon-ui-server opea/codegen-gradio-ui:latest "python codegen_ui_g…" codegen-xeon-ui-server 9 minutes ago Up 8 minutes 0.0.0.0:5173->5173/tcp, :::5173->5173/tcp dataprep-redis-server opea/dataprep:latest "sh -c 'python $( [ …" dataprep-redis-server 9 minutes ago Up 9 minutes (healthy) 0.0.0.0:6007->5000/tcp, [::]:6007->5000/tcp llm-codegen-vllm-server opea/llm-textgen:latest "bash entrypoint.sh" llm-vllm-service 9 minutes ago Up 8 minutes 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp llm-textgen-server opea/llm-textgen:latest "bash entrypoint.sh" llm-base 9 minutes ago Up 9 minutes redis-vector-db redis/redis-stack:7.2.0-v9 "/entrypoint.sh" redis-vector-db 9 minutes ago Up 9 minutes 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 0.0.0.0:8001->8001/tcp, :::8001->8001/tcp retriever-redis opea/retriever:latest "python opea_retriev…" retriever-redis 9 minutes ago Up 9 minutes 0.0.0.0:7000->7000/tcp, :::7000->7000/tcp tei-embedding-server opea/embedding:latest "sh -c 'python $( [ …" tei-embedding-server 9 minutes ago Up 9 minutes 0.0.0.0:6000->6000/tcp, :::6000->6000/tcp tei-embedding-serving ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 "/bin/sh -c 'apt-get…" tei-embedding-serving 9 minutes ago Up 9 minutes (healthy) 0.0.0.0:8090->80/tcp, [::]:8090->80/tcp vllm-server opea/vllm:latest "python3 -m vllm.ent…" vllm-service 9 minutes ago Up 9 minutes (healthy) 0.0.0.0:8028->80/tcp, [::]:8028->80/tcp
Description
model = "codellama/CodeLlama-7b-Python-hf"
command: "curl http://
Run also with Gradio UI same behavior.
Gets stuck in a loop, generating the same output until the maximum token limit is reached.
this happens very often with codellama model but not every time.
See output of run 1: Prompt: "Write a Python function that generates fibonnaci sequence" Output: " up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a, end=" ") c = a + b a = b b = c
fibonacci(10) " See output of run 2: Prompt: "Write a Python function that generates fibonnaci sequence" Output: " up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)
Write a Python function that generates fibonnaci sequence up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)
Write a Python function that generates fibonnaci sequence up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)
Write a Python function that generates fibonnaci sequence up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)
Write a Python function that generates fibonnaci sequence up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)
Write a Python function that generates fibonnaci sequence up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)
Write a Python function that generates fibonnaci sequence up to n numbers.
def fibonacci(n): a = 0 b = 1 for i in range(n): print(a) a, b = b, a + b
fibonacci(10)"
....
Reproduce steps
"curl http://:7778/v1/codegen -H "Content-Type: application/json" -d '{"messages": "Write a Python function that generates fibonnaci sequence."}'"
Raw log
Attachments
No response
It's the output quality issue of AI models.
From the workflow side, we cannot change the results. My suggestion is to use a better model, I recommend to try Qwen model "Qwen/Qwen2.5-Coder-7B-Instruct".
While Qwen is a good model, we need to make sure other models also work with codegen in OPEA
Could the CodeLlama bug be an old one .. https://stackoverflow.com/questions/76772509/llama-2-7b-hf-repeats-context-of-question-directly-from-input-prompt-cuts-off-w#:~:text=To%20address%20this%20issue%2C%20you%20need%20a%20way,find%20the%20length%20of%20the%20prompt%20token%20ids and https://github.com/meta-llama/codellama/issues/89
We might need some model specific code if there is no consistent API. Wonder if LLama Stack SDK has something to tackle this.
set return_full_text=False to only get generated text (and not the input + generated text).
Would that break things if used also with e.g. Qwen?
set return_full_text=False to only get generated text (and not the input + generated text).
Would that break things if used also with e.g. Qwen?
return_full_text=False is one of the input of HuggingFacePipeline. OPEA does not use this API. Micro-service is a services, so TGI/vLLM is used.
We might need some model specific code if there is no consistent API. Wonder if LLama Stack SDK has something to tackle this. https://github.com/meta-llama/codellama/issues
codellama/CodeLlama-7b-Python-hf such model are finetuned from base model for transfer from nutural language to (computer) code generation. It's code generation, so you input prompt should be some source code, then the model generates next source code (tokens). If your prompt are instructions, e.g, Write a Python function that generates fibonnaci sequence", the context is not a source code. The model is not instruction following model. So it's behavior may not be as you expected. Overview the codellama issue, you can see reported random output, cannot stop untill reach the end of the Max-tokens etc.
You can try model with Instruct tuned, where the model are trained with instructions input, the output the code. I expect the model behaves better. BTW, Qwen 3 are instruct following even without "instruct" word in model name.
Thanks for the reply. I tried the Instruct 7B model—it seems to give slightly better answers, but it still tends to get stuck in a loop, repeating parts of the response until the max token limit is reached. According to the Hugging Face webpage, the Instruct model is intended to support also chat mode, whereas the others are designed for code completion. I'll try out some of the larger instruct models as well.
I root cause the issue.
It's the CodeLLama model input prompt template cause this issue.
Here is my answer to the question:
1. Why Codellama does not stable.
Codellama model is code completion/Infilling, It's NOT Instruction fine-tuning. So it does NOT output as your expectation for instructions like Write a Python function that generates fibonnaci sequence" It works in the condition of context of (input) source code.
2. Why Codellama-instruct model generating the same output until the maximum token limit is reached?
I investigated it. The conclusion is the CodeLllama model Serices are sensitive to the prompt (template format).
I reproduced the issue with HuggingFace transformers API interface with CodeLllama and CodeLllama-instruct model. However, it cannot be reproduced with Huggingface pipeline interface.
From the OPEA Interface, I did some experiments:
a. reproduce the issue. output KO,
curl http://${HOST_IP}:7778/v1/codegen \
-H "Content-Type: application/json" \
-d '{"messages": "Write a Python function that generates fibonnaci sequence."}'
Ablation analyses
After checking all the parameters between transformers API and OPEA interface, they does not the cause of the issue. (presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.01, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None)
Tried the LLM interfaces: b. reproduce the issue. output KO
curl http://${HOST_IP}:9000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/CodeLlama-7b-Instruct-hf", "messages": "Write a Python function that generates fibonnaci sequence.", "max_tokens":1024}'
c. Canont reproduce the issue. output OK,
curl http://${HOST_IP}:9000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/CodeLlama-7b-Instruct-hf", "messages": [{"role": "user", "content": "Write a Python function that generates fibonnaci sequence."}], "max_tokens":1024}'
Ablation analyses
Then I set the message format as above item c with transformer API interface, the output seems better but not as expected.
prompt="{"role": "user", "content": "Write a Python function that generates fibonnaci sequence.}"
So I confirmed the Codel-LLama are sensitive to the prompt. I cannot find such information from [huggingface model cards] (https://huggingface.co/meta-llama/CodeLlama-7b-Instruct-hf), https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code at all.
checking the log of vLLM, the intput prompt OK and KO are different are the input prompt:
[{'role': 'user', 'content': 'Write a Python function that generates fibonnaci sequence.'}]
vLLM service log: OK
INFO 05-23 10:51:30 [logger.py:39] Received request chatcmpl-acbeabc6c24d4aee899b3d94707905b4: prompt: '<s>[INST] Write a Python function that generates fibonnaci sequence. [/INST]', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.01, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
vLLM service log: KO
“INFO 05-23 10:47:58 [logger.py:39] Received request cmpl-7f916f8e093f4869bff6be32f7447c32-0: prompt: 'Write a Python function that generates fibonnaci sequence.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.75, top_p=0.95, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 14350, 263, 5132, 740, 393, 16785, 18755, 11586, 455, 5665, 29889], lora_request: None, prompt_adapter_request: None.”
Try the vLLM prompt message in Transformers API interface, set prompt as this format "<s>[INST] Messages [/INST], the output are OK now.
According to prompt-template founding, @wangkl2 founds out this blog, it confirms to our conclusion: https://statics.teams.cdn.office.net/evergreen-assets/safelinks/1/atp-safelinks.html
Conclusion:
- Codellama mode is code completion/Infilling, It's NOT Instruction fine-tuning.
- CodeLllama-instruct model does not works as expected because code-llama model serices are sensitive to the prompt (template format).
- This issue will be fixed soon.
About the CodeLlama model
It is not recommended to use CodeLlama model. The quality (score) of code-llama is not good.
Please refer to the huggingface bigcode-models-leaderboard. Be careful about the instruct setting: some instruct models are not name with 'instruct', when you choose 'instruct', it does list all scores of all instruction-following models.
We recommend Qwen model.
It is not recommended to use CodeLlama model. The quality (score) of code-llama is not good.
Please refer to the huggingface bigcode-models-leaderboard. Be careful about the instruct setting: some instruct models are not name with 'instruct', when you choose 'instruct', it does list all scores of all instruction-following models.
We recommend Qwen model.
FYR: Another benchmark results https://hub.athina.ai/blogs/top-open-source-models-for-code-generation-in-2025