create_chat_completion is stuck in versions 0.2.84 and 0.2.85 for Mac Silicon
Prerequisites
Version 0.2.84 or 0.2.85 and using create_chat_completion method. Tried different GGUF models.
Please answer the following questions for yourself before submitting an issue.
- [ X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ X ] I carefully followed the README.md.
- [ X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [ X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Provide a result as described in the documentation.
Current Behavior
Inference is stuck (I let it run for 5 minutes). After downgrading to version 0.2.83 everything runs without a single change in the code.
Environment and Context
Mac M1 MAX, 32GB RAM, MacOS 14.5, Python 3.12, llama-cpp-python 0.2.84/5.
Can't recreate this issue with :
M1 , Python 3.12 , 0.2.85 , Phi-3 and Llama 3.1
Hi,
thanks for checking.
When I recreate your test, it is working.
The problem seems to be when using JSON schema with 0.2.84/5. Working with 0.2.83.
Please find attached Jupyter NB. It locks on Mac.
Kind regards
Lukáš

On 3. 8. 2024, at 11:52, Shamit Verma @.***> wrote:
Can't recreate this issue with :
M1 , Python 3.12 , 0.2.85 , Phi-3 and Llama 3.1
image.png (view on web) https://github.com/user-attachments/assets/5ae9ca01-5bd9-4881-912b-5c75fe1aa6c8 image.png (view on web) https://github.com/user-attachments/assets/9fd7792c-bf66-4a5d-80f4-64e77fab2c57 — Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/1648#issuecomment-2266658063, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXURXD7PDP5ZFEIF2PB65TZPSR5HAVCNFSM6AAAAABL3GCO2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRWGY2TQMBWGM. You are receiving this because you authored the thread.
Don't see the attachment somehow.
No problem, here it is:
from llama_cpp import Llama from llama_cpp.llama_speculative import LlamaPromptLookupDecoding model_path = "/Users/macmacmac/Documents/CODING/models/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf"
model = Llama( model_path=str(model_path), draft_model=LlamaPromptLookupDecoding( num_pred_tokens=13, max_ngram_size=9 ), n_ctx=8192, n_batch=128, last_n_tokens_size=128, n_gpu_layers=-1, f16_kv=True, offload_kqv=True, flash_attn=True, n_threads=2, n_threads_batch=2, chat_format="chatml" )
schema = """
{
"type": "object",
"properties":
{
"response":
{
"type": "string"
}
},
"required": ["response"]
}"""
completion = model.create_chat_completion( messages=[ {"role": "system", "content": "You are an assistant."}, { "role": "user", "content": "What is most popular street food in Paris? Answer in JSON. Put your answer in 'response' property. Use schema: {'response': '...'}" } ], max_tokens=-1, temperature=0.25, top_k=25, top_p=0.8, min_p=0.025, typical_p=0.8, tfs_z=0.6, mirostat_mode=2, mirostat_tau=2.2, mirostat_eta=0.025, response_format = { "type": "json_object", "schema": schema} ) print(completion)
On 5. 8. 2024, at 12:56, Shamit Verma @.***> wrote:
Don't see the attachment somehow.
— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/1648#issuecomment-2268788798, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXURXD2PRUQWP4NBZWNMGLZP5K7DAVCNFSM6AAAAABL3GCO2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRYG44DQNZZHA. You are receiving this because you authored the thread.
Yup - willing to bet this is fixed in https://github.com/abetlen/llama-cpp-python/pull/1649 - there's a whole cluster of issues that will get cleared with this change.
Version 0.2.87 does not have this issue