jan
jan copied to clipboard
bug: nitro CPU usage tail after visual end of token stream
Describe the bug
The nitro process keep using CPU after inference is done.
Steps to reproduce Steps to reproduce the behavior:
- Install Jan and download the
Mistral Instruct 7B Q4model. - Do not use GPU acceleration, run in CPU mode.
- Create new thread and try to run inference with default parameters, even simple prompt will do.
- Watch system CPU monitor and notice that CPU is still used while token stream is already end.
Expected behavior Expected CPU usage to drop immediately after response is fully received in UI.
Screenshots
Environment details
- Operating System: Ubuntu 22.04
- Jan Version: 0.4.5-208 (as well as stable 0.4.5, tried both)
- Processor: AMD Ryzen 7 5700G
- RAM: 64GB
- Any additional relevant hardware specifics: no GPU used for inference, NVMe SSD used for storage
Logs Typed prompt in input box, clicked "Send":
2024-01-29T15:22:24.925Z [NITRO]::Debug: 20240129 15:22:24.924886 UTC 41641 ERROR Grammar file not found - llamaCPP.cc:183
2024-01-29T15:22:24.925Z [NITRO]::Debug: 20240129 15:22:24.924905 UTC 41641 INFO Resolved request for task_id:0 - llamaCPP.cc:296
20240129 15:22:24.924923 UTC 41641 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:533
[1706541744] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 879][ launch_slot_with_data] slot 0 is processing [task id: 6]
[1706541744] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1733][ update_slots] slot 0 : kv cache rm - [0, end)
Assistant response is received completely and shown in thread, UI looks like inference is done, but CPU usage is still high:
2024-01-29T15:22:35.328Z [NITRO]::Debug: [1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 472][ print_timings]
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 473][ print_timings] print_timings: prompt eval time = 1315.45 ms / 23 tokens ( 57.19 ms per token, 17.48 tokens per second)
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 478][ print_timings] print_timings: eval time = 9087.76 ms / 90 runs ( 100.98 ms per token, 9.90 tokens per second)
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 483][ print_timings] print_timings: total time = 10403.21 ms
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1596][ update_slots] slot 0 released (114 tokens in cache)
2024-01-29T15:22:35.328Z [NITRO]::Debug: 20240129 15:22:35.328345 UTC 41641 INFO reached result stop - llamaCPP.cc:345
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1596][ update_slots] slot 0 released (114 tokens in cache)
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 879][ launch_slot_with_data] slot 0 is processing [task id: 8]
20240129 15:22:35.328405 UTC 41641 INFO Connection closed or buffer is null. Reset context - llamaCPP.cc:309
[1706541755] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1733][ update_slots] slot 0 : kv cache rm - [0, end)
2024-01-29T15:22:36.161Z [NITRO]::Debug: [1706541756] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1462][ process_tasks] slot unavailable
CPU usage finally drops to normal:
2024-01-29T15:22:52.082Z [NITRO]::Debug: [1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 472][ print_timings]
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 473][ print_timings] print_timings: prompt eval time = 935.57 ms / 23 tokens ( 40.68 ms per token, 24.58 tokens per second)
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 478][ print_timings] print_timings: eval time = 15818.64 ms / 160 runs ( 98.87 ms per token, 10.11 tokens per second)
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 483][ print_timings] print_timings: total time = 16754.21 ms
[1706541772] [/home/runner/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1596][ update_slots] slot 0 released (184 tokens in cache)
Additional context
- For this particular bug report I used the next prompt which takes about 10 seconds to process:
List 3 item names which can be found in fridge at home.
- If I use prompt which require longer assistant response and longer time to process then it also make the mysterious CPU usage tail longer (proportionally?).
- The process with high CPU usage is:
/home/zezic/jan/extensions/@janhq/inference-nitro-extension/dist/bin/linux-cpu/nitro 1 127.0.0.1 3928
- Right after inference is done (visually) but while CPU usage is still high it's not possible to receive new responses from assistant in the same thread, it's necessary to wait for CPU usage to drop to normal and only then send the next request.
similar experience here, even after closing jan entirely, the tail just goes on and on until I process kill nitro
This could be due to the fact that nitro is handling the prompt processing
The same issue appears on macOS with GPU support. After the response on the prompt is finished, Jan/Nitro continue to use full GPU as if it empties a cache or something. Similar behavior is not experienced with llama.cpp or Ollama for example.
hi @lineality, this issue is most likely resolved in our latest nightly build, 0.4.6-283, can you double check? many thanks
I'm closing this for the time being because no update from users. Feel free to reopen this if the issue persists.
Please reopen. Facing the high cpu usage on nitro process after changing model on v0.4.12.