ik_llama.cpp icon indicating copy to clipboard operation
ik_llama.cpp copied to clipboard

Bug: [Server] Verbose flag doesn't log text completions contents.

Open Ph0rk0z opened this issue 1 month ago • 18 comments

What happened?

I run the server with -v and text completions do not show the prompts I have sent. Chat completions do show up in the terminal. Not sure if they're fully logged.

I switch the name behavior in sillytavern but don't see name appended to the start of the prompt. Maybe ST is broken in that regard but I should definitely be able to see the other requests.

Anything else I can do besides compile with debug llama-server? When I do that it prints individual token stats and it's a bit much.

Name and Version

HEAD

What operating system are you seeing the problem on?

No response

Relevant log output

add_text: <|im_start|>system

vs

INFO [            update_slots] kv cache rm [p0, end) | tid="139703764496384" timestamp=1762478090 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =    8739.27 ms /  1703 tokens (    5.13 ms per token,   194.87 tokens per second) | tid="139703764496384" timestamp=1762478102 id_slot=0 id_task=0 t_prompt_processing=8739.271 n_prompt_tokens_processed=1703 t_token=5.131691720493247 n_tokens_second=194.86751240463877

Ph0rk0z avatar Nov 07 '25 01:11 Ph0rk0z

Do you want to log the prompt or the final generated message in verbose?

firecoperana avatar Nov 09 '25 00:11 firecoperana

At least log the prompt so I can see what my client is sending and if the template is right, etc.

Ph0rk0z avatar Nov 09 '25 01:11 Ph0rk0z

Is this still not solved?

ikawrakow avatar Nov 10 '25 13:11 ikawrakow

I can see the prompt logged when using text completions: update_slots] prompt tokenized | tid="39232" timestamp="07:55:02" id_slot=0 id_task=0 n_ctx=32768 n_keep=0 n_prompt_tokens=26 prompt_tokens=.....

firecoperana avatar Nov 10 '25 13:11 firecoperana

I will pull and check. Was still doing deepseek with merged PRs.

Ph0rk0z avatar Nov 10 '25 14:11 Ph0rk0z

 --verbose-prompt --verbosity 2

produces output tokens in two messages (the same token):

VERB [           process_token] next token | tid="139734367039488" timestamp=1762784078 id_slot=0 id_task=99321 token=5012 token_text=" others" has_next_token=true n_remain=-1 n_decoded=1028 stopped_eos=false stopped_word=false stopped_limit=false stopping_word=""
VERB [       server_sent_event] data stream, to_send: %s | ="data: {\"choices\":[{\"finish_reason\":null,\"index\":0,\"delta\":{\"reasoning_content\":\" others\"}}],\"created\":1762784078,\"id\":\"chatcmpl-7mhaNrV0dIva5XFK7MyfWHdF8adNC11g\",\"model\":\"\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1028,\"prompt_tokens\":103282,\"total_tokens\":104310}}\n\n"

I can't see any problem here.

magikRUKKOLA avatar Nov 10 '25 14:11 magikRUKKOLA

I had to fight with screaming fans and go to work but I just fired up ye old deepseek.

XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
INFO [                    init] initializing slots | tid="140059275628544" timestamp=1762822512 n_slots=1
INFO [                    init] new slot | tid="140059275628544" timestamp=1762822512 id_slot=0 n_ctx_slot=32768
INFO [                    main] model loaded | tid="140059275628544" timestamp=1762822512
INFO [                    main] chat template | tid="140059275628544" timestamp=1762822512 chat_template="{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<|User|>' + message['content'] + '<|Assistant|>'}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{%- endif %}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- set ns.is_output_first = true %}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- else %}{{message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none)%}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}"
INFO [                    main] chat template | tid="140059275628544" timestamp=1762822512 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
INFO [                    main] HTTP server listening | tid="140059275628544" timestamp=1762822512 n_threads_http="95" port="8080" hostname="192.168.1.211"
INFO [            update_slots] all slots are idle | tid="140059275628544" timestamp=1762822512
INFO [      log_server_request] request | tid="139837113716736" timestamp=1762822530 remote_addr="192.168.1.199" remote_port=35736 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="139837113716736" timestamp=1762822530 remote_addr="192.168.1.199" remote_port=35736 status=200 method="GET" path="/props" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140059275628544" timestamp=1762822533 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140059275628544" timestamp=1762822533 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =   29687.35 ms /   873 tokens (   34.01 ms per token,    29.41 tokens per second) | tid="140059275628544" timestamp=1762822568 id_slot=0 id_task=0 t_prompt_processing=29687.349 n_prompt_tokens_processed=873 t_token=34.00612714776632 n_tokens_second=29.406465360042763
INFO [           print_timings] generation eval time =    5280.12 ms /    52 runs   (  101.54 ms per token,     9.85 tokens per second) | tid="140059275628544" timestamp=1762822568 id_slot=0 id_task=0 t_token_generation=5280.125 n_decoded=52 t_token=101.54086538461539 n_tokens_second=9.848251698586681
INFO [           print_timings]           total time =   34967.47 ms | tid="140059275628544" timestamp=1762822568 id_slot=0 id_task=0 t_prompt_processing=29687.349 t_token_generation=5280.125 t_total=34967.474
INFO [            update_slots] slot released | tid="140059275628544" timestamp=1762822568 id_slot=0 id_task=0 n_ctx=32768 n_past=924 n_system_tokens=0 n_cache_tokens=924 truncated=false
INFO [            update_slots] all slots are idle | tid="140059275628544" timestamp=1762822568
INFO [      log_server_request] request | tid="139837113716736" timestamp=1762822568 remote_addr="192.168.1.199" remote_port=35736 status=200 method="POST" path="/completion" params={}
INFO [            update_slots] all slots are idle | tid="140059275628544" timestamp=1762822568
INFO [      log_server_request] request | tid="139837113716736" timestamp=1762822568 remote_addr="192.168.1.199" remote_port=35736 status=200 method="POST" path="/tokenize" params={}
^CINFO [            update_slots] all slots are idle | tid="140059275628544" timestamp=1762822578

    -b 4096 \
    -amb 1024 \
    -mqkv \
    --verbose \
    -cuda fusion=1,offload-batch-size=4 \
    -ot "blk\.(6|7|8)\.ffn_.*(exps).=CUDA0" \

I'm shooting chatml at it and sent prompt is still invisible.

commit ef0b7caa8e5d451ee428c0d63c9f6c21d99831c8 (HEAD -> main)
Merge: ece76b27 489554bc
Author: Your Name <[email protected]>
Date:   Mon Nov 10 18:05:46 2025 -0600

    Merge branch 'main' of https://github.com/ikawrakow/ik_llama.cpp

commit 489554bc1126bcacd93fbf0b395d6c3062fe0d64 (origin/main, origin/HEAD)
Author: Iwan Kawrakow <[email protected]>
Date:   Mon Nov 10 17:38:46 2025 +0200

    Minor: remove unnecesssary calls to build_inp_out_ids (#935)
    
    Co-authored-by: Iwan Kawrakow <[email protected]>

commit ece76b27caf94a5cf21d7e97ab76f3d9f29c55c7

Also tried to do llama-bench on this model to find offload batch size and it segfaults. My prefill actually over 100 on big batches.. but as you see.. on a regular batch... Most prompts sent in chat are intital ~2k and then small back and forths.

Ph0rk0z avatar Nov 11 '25 00:11 Ph0rk0z

@Ph0rk0z

I'm shooting chatml at it and sent prompt is still invisible.

Wait. What prompt?

Are you saying that you cannot find the prompt you sent to the server in the app through which you sent the prompt to the server and hence you would want the server to have the logs of the prompt?

magikRUKKOLA avatar Nov 11 '25 00:11 magikRUKKOLA

Sure, I can find it in the front end but how do I know it matches what is sent to the server. Maybe my template is wrong or I put a double BOS or a million other things. Maybe the front end is broken and tells you one thing but sends another. It's also nice to see if your requests are sending correct sampling parameters or if they flubbed and named it topp.

Let me also spin you another example from the here and now. I'm using V3 with chatML and it's returning

{message: here I am emotion: happy }

Yes it's OOD but I'd like to know if that's because something is screwed up or it can't handle OOD. R1-0528 works just fine.. or did last week when prompted this way. So I'm loading it from HDD to double check something else isn't broken instead of yaknow.. just seeing what went to the server.

Ph0rk0z avatar Nov 11 '25 00:11 Ph0rk0z

@Ph0rk0z

I can find it in the front end but how do I know it matches what is sent to the server.

Via the RPC in json format ?

magikRUKKOLA avatar Nov 11 '25 00:11 magikRUKKOLA

@Ph0rk0z

Moreover, in case you export the conversation via the RPC it would contain the seed parameter required for the reproducible results. So the next in case something odd is going on you will be able to reproduce the result and potentially debug it.

magikRUKKOLA avatar Nov 11 '25 00:11 magikRUKKOLA

Yea, that's useful too. Whichever way it happens. Either the bare text prompt or the whole json request. If you ever used tabbyapi and enabled prompt logging. For chat completions you at least get the prompt with special tokens applied.

Ph0rk0z avatar Nov 11 '25 00:11 Ph0rk0z

@Ph0rk0z

Either the bare text prompt or the whole json request.

sudo apt install curl jq glow
curl http://localhost:8080/slots | jq -r '.[].prompt' | glow

Huh?

magikRUKKOLA avatar Nov 11 '25 00:11 magikRUKKOLA

@Ph0rk0z

I just kinda don't see the point of additional flags in case the whole prompt can be pulled from the server directly. Moreover its also possible to save and restore the dumps to cut down the prefill time for the continuation of the conversation etc.

magikRUKKOLA avatar Nov 11 '25 00:11 magikRUKKOLA

Why would I use an external tool for this? Next thing you'll tell me to use a MITM proxy. Either way, it wasn't me, text completions are fully borked.

Image

Ph0rk0z avatar Nov 11 '25 00:11 Ph0rk0z

@Ph0rk0z

Why would I use an external tool for this?

What tool? The curl? It's totally optional. You could use netcat etc.

magikRUKKOLA avatar Nov 11 '25 01:11 magikRUKKOLA

The functionality is already in server. Setting verbose flag, how come I don't get:

Image

without all this logit spam:

Image

That's all this was about. In 5 seconds I can see that deepseek is replying with a request for some reason and I can troubleshoot from here on out what's changed between the last few days and now.

Ph0rk0z avatar Nov 11 '25 01:11 Ph0rk0z

In case anyone else is suddenly thinking they're crazy and seeing json in their replies: https://github.com/SillyTavern/SillyTavern/commit/5837f2859977fcdf544f6ed69f67e6c13dc02b99 just delete {}

Reading the logs, the server_verbose compile time flag causes each streamed token to print to the console, the prompt print twice, etc. The choice is between that or nothing unless there is a verbosity level that can be set to only show the req. Hope the "why" makes more sense now.

Ph0rk0z avatar Nov 11 '25 02:11 Ph0rk0z