Daniel Hiltgen
Daniel Hiltgen
@Wanhack others have reported that running `nvidia-modprobe -u` on your host may resolve the issue (might require a reboot)
I've updated the description above to better describe how this works. There are 2 layers of concurrency introduced by this change. One layer is leveraging the parallelism support in llama.cpp...
@alexander-potemkin we don't currently have any limits to the number of client connections. I don't believe we have wired up any sort of expiration/timeout setting on the server side, although...
Slight correction to the above. I just updated the implementation for concurrent requests to a single model to use a semaphore package that does FIFO for blocked requests so that...
Note for people following along. I've adjusted the defaults so this PR now mimics current behavior of a single request at a time, and a single model at a time,...
@artem-zinnatullin thanks for giving it a try! We've been making minor fixes to the memory prediction on main, which I've been rebasing into this PR. I've got a Windows test...
I wasn't able to reproduce. ``` % system_profiler SPSoftwareDataType SPHardwareDataType Software: System Software Overview: System Version: macOS 10.15.7 (19H2026) Kernel Version: Darwin 19.6.0 Boot Volume: ssd Boot Mode: Normal Computer...
If you're still having trouble, please let us know.
Unfortunately it looks like your server is crashing. Can you share your server.log so we can see why?
@liquorLiu that log doesn't seem to contain a crash or any error messages. Let's try a different approach to try to understand what's going wrong. Please Quit the tray app,...