gpt4all
gpt4all copied to clipboard
Expose llama.cpp's progress_callback to bindings
We could expose llama.cpp's progress_callback to provide a way to both report progress and cancel model loading via the bindings.
ref #1934
https://discord.com/channels/1076964370942267462/1100510109106450493/1214449811374342144 may be possible already?
may be possible already?
There is an important difference between canceling model loading (copying tensors from disk to RAM/VRAM, needs the progress callback), canceling prompt processing (because we don't split our input to llama_decode into batches, the simplest way forward is the ggml graph abort callback), and canceling token generation (which is simple and already implemented in the backend because we generate one token at a time).