Prerequisites

Version 0.2.84 or 0.2.85 and using create_chat_completion method. Tried different GGUF models.

Please answer the following questions for yourself before submitting an issue.

[ X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ X ] I carefully followed the README.md.
[ X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Provide a result as described in the documentation.

Current Behavior

Inference is stuck (I let it run for 5 minutes). After downgrading to version 0.2.83 everything runs without a single change in the code.

Environment and Context

Mac M1 MAX, 32GB RAM, MacOS 14.5, Python 3.12, llama-cpp-python 0.2.84/5.

Aug 01 '24 19:08 mobeetle

Can't recreate this issue with :

M1 , Python 3.12 , 0.2.85 , Phi-3 and Llama 3.1

Aug 03 '24 09:08 shamitv

Hi,

thanks for checking.

When I recreate your test, it is working.

The problem seems to be when using JSON schema with 0.2.84/5. Working with 0.2.83.

Please find attached Jupyter NB. It locks on Mac.

Kind regards

Lukáš

On 3. 8. 2024, at 11:52, Shamit Verma @.***> wrote:

Can't recreate this issue with :

M1 , Python 3.12 , 0.2.85 , Phi-3 and Llama 3.1

image.png (view on web) https://github.com/user-attachments/assets/5ae9ca01-5bd9-4881-912b-5c75fe1aa6c8 image.png (view on web) https://github.com/user-attachments/assets/9fd7792c-bf66-4a5d-80f4-64e77fab2c57 — Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/1648#issuecomment-2266658063, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXURXD7PDP5ZFEIF2PB65TZPSR5HAVCNFSM6AAAAABL3GCO2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRWGY2TQMBWGM. You are receiving this because you authored the thread.

Aug 04 '24 10:08 mobeetle

Don't see the attachment somehow.

Aug 05 '24 10:08 shamitv

No problem, here it is:

from llama_cpp import Llama from llama_cpp.llama_speculative import LlamaPromptLookupDecoding model_path = "/Users/macmacmac/Documents/CODING/models/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf"

model = Llama( model_path=str(model_path), draft_model=LlamaPromptLookupDecoding( num_pred_tokens=13, max_ngram_size=9 ), n_ctx=8192, n_batch=128, last_n_tokens_size=128, n_gpu_layers=-1, f16_kv=True, offload_kqv=True, flash_attn=True, n_threads=2, n_threads_batch=2, chat_format="chatml" )

schema = """
{ "type": "object", "properties": { "response": { "type": "string" } }, "required": ["response"] }"""

completion = model.create_chat_completion( messages=[ {"role": "system", "content": "You are an assistant."}, { "role": "user", "content": "What is most popular street food in Paris? Answer in JSON. Put your answer in 'response' property. Use schema: {'response': '...'}" } ], max_tokens=-1, temperature=0.25, top_k=25, top_p=0.8, min_p=0.025, typical_p=0.8, tfs_z=0.6, mirostat_mode=2, mirostat_tau=2.2, mirostat_eta=0.025, response_format = { "type": "json_object", "schema": schema} ) print(completion)

On 5. 8. 2024, at 12:56, Shamit Verma @.***> wrote:

Don't see the attachment somehow.

— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/1648#issuecomment-2268788798, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXURXD2PRUQWP4NBZWNMGLZP5K7DAVCNFSM6AAAAABL3GCO2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRYG44DQNZZHA. You are receiving this because you authored the thread.

Aug 05 '24 11:08 mobeetle

Yup - willing to bet this is fixed in https://github.com/abetlen/llama-cpp-python/pull/1649 - there's a whole cluster of issues that will get cleared with this change.

Aug 06 '24 15:08 handshape

Version 0.2.87 does not have this issue

Aug 11 '24 07:08 shamitv

create_chat_completion is stuck in versions 0.2.84 and 0.2.85 for Mac Silicon

Prerequisites

Expected Behavior

Current Behavior

Environment and Context