jan bug: nitro CPU usage tail after visual end of token stream

Describe the bug The nitro process keep using CPU after inference is done.

Steps to reproduce Steps to reproduce the behavior:

Install Jan and download the Mistral Instruct 7B Q4 model.
Do not use GPU acceleration, run in CPU mode.
Create new thread and try to run inference with default parameters, even simple prompt will do.
Watch system CPU monitor and notice that CPU is still used while token stream is already end.

Expected behavior Expected CPU usage to drop immediately after response is fully received in UI.

Screenshots Screenshot from 2024-01-29 18-16-58

Environment details

Operating System: Ubuntu 22.04
Jan Version: 0.4.5-208 (as well as stable 0.4.5, tried both)
Processor: AMD Ryzen 7 5700G
RAM: 64GB
Any additional relevant hardware specifics: no GPU used for inference, NVMe SSD used for storage

Logs Typed prompt in input box, clicked "Send":

2024-01-29T15:22:24.925Z [NITRO]::Debug: 20240129 15:22:24.924886 UTC 41641 ERROR Grammar file not found - llamaCPP.cc:183

2024-01-29T15:22:24.925Z [NITRO]::Debug: 20240129 15:22:24.924905 UTC 41641 INFO  Resolved request for task_id:0 - llamaCPP.cc:296
20240129 15:22:24.924923 UTC 41641 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:533
[1706541744] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  879][   launch_slot_with_data] slot 0 is processing [task id: 6]
[1706541744] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1733][            update_slots] slot 0 : kv cache rm - [0, end)

Assistant response is received completely and shown in thread, UI looks like inference is done, but CPU usage is still high:

2024-01-29T15:22:35.328Z [NITRO]::Debug: [1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  472][           print_timings] 
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  473][           print_timings] print_timings: prompt eval time =    1315.45 ms /    23 tokens (   57.19 ms per token,    17.48 tokens per second)
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  478][           print_timings] print_timings:        eval time =    9087.76 ms /    90 runs   (  100.98 ms per token,     9.90 tokens per second)
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  483][           print_timings] print_timings:       total time =   10403.21 ms
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1596][            update_slots] slot 0 released (114 tokens in cache)

2024-01-29T15:22:35.328Z [NITRO]::Debug: 20240129 15:22:35.328345 UTC 41641 INFO  reached result stop - llamaCPP.cc:345
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1596][            update_slots] slot 0 released (114 tokens in cache)
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  879][   launch_slot_with_data] slot 0 is processing [task id: 8]
20240129 15:22:35.328405 UTC 41641 INFO  Connection closed or buffer is null. Reset context - llamaCPP.cc:309
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1733][            update_slots] slot 0 : kv cache rm - [0, end)

2024-01-29T15:22:36.161Z [NITRO]::Debug: [1706541756] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1462][           process_tasks] slot unavailable

CPU usage finally drops to normal:

2024-01-29T15:22:52.082Z [NITRO]::Debug: [1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  472][           print_timings] 
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  473][           print_timings] print_timings: prompt eval time =     935.57 ms /    23 tokens (   40.68 ms per token,    24.58 tokens per second)
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  478][           print_timings] print_timings:        eval time =   15818.64 ms /   160 runs   (   98.87 ms per token,    10.11 tokens per second)
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  483][           print_timings] print_timings:       total time =   16754.21 ms
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1596][            update_slots] slot 0 released (184 tokens in cache)

Additional context

For this particular bug report I used the next prompt which takes about 10 seconds to process:

List 3 item names which can be found in fridge at home.

If I use prompt which require longer assistant response and longer time to process then it also make the mysterious CPU usage tail longer (proportionally?).
The process with high CPU usage is:

/home/zezic/jan/extensions/@janhq/inference-nitro-extension/dist/bin/linux-cpu/nitro 1 127.0.0.1 3928

Right after inference is done (visually) but while CPU usage is still high it's not possible to receive new responses from assistant in the same thread, it's necessary to wait for CPU usage to drop to normal and only then send the next request.

Jan 29 '24 15:01 zezic

similar experience here, even after closing jan entirely, the tail just goes on and on until I process kill nitro

Jan 29 '24 19:01 lineality

This could be due to the fact that nitro is handling the prompt processing

Jan 31 '24 00:01 tikikun

The same issue appears on macOS with GPU support. After the response on the prompt is finished, Jan/Nitro continue to use full GPU as if it empties a cache or something. Similar behavior is not experienced with llama.cpp or Ollama for example.

Feb 13 '24 17:02 vlbosch

hi @lineality, this issue is most likely resolved in our latest nightly build, 0.4.6-283, can you double check? many thanks

Feb 22 '24 14:02 Van-QA

I'm closing this for the time being because no update from users. Feel free to reopen this if the issue persists.

Mar 05 '24 09:03 Van-QA

Please reopen. Facing the high cpu usage on nitro process after changing model on v0.4.12.

Apr 29 '24 14:04 SmokeShine

jan jan copied to clipboard

bug: nitro CPU usage tail after visual end of token stream

jan
jan copied to clipboard