Diego Devesa
Diego Devesa
I have added the bug tag that will prevent the bot from closing the issue. Pointing at the specific PRs that introduced a regression would improve the chances of this...
You may be able to get it to run by increasing `LLAMA_MAX_NODES` in `llama.cpp`.
You would also need to increase `GGML_SCHED_MAX_SPLITS` then. But using a build without GPU acceleration would also work.
The issue with 2 expert models should be fixed in #6735.
> If you don't know how many layers there are, you can use -1 to move all to GPU. That's not the case in the llama.cpp C API.
> I'd like to know what standards should be met before merge this PR? Can this PR be merged first and then continue to fix the above problems? I can...
You can also make an operation run on the CPU by returning `false` from `supports_op`.
There should still be some limit to avoid getting into an infinite loop in the server.
When this happens, the response of `/completion` has these fields: ```json "truncated": true, "stopped_eos": false, "stopped_word": false, "stopped_limit": false, ``` I am not familiar with the meaning of each of...
Maybe it would be simpler to set `n_predict` to `n_ctx_train` by default if not set in the request.