LLamaSharp
LLamaSharp copied to clipboard
LoadState() not restoring context when using CUDA backend?
I'm implementing a Save State and Load State system to pick up past conversations. I did all my development using CPU and just recently got a GPU to try it with. On CPU everything works as expected by saving the context to disk, then later reloading it, and sending on the new messages. It respects the past context and provides a response.
I came across what looks like a bug with the context restoration when using GPU, though. It appears as though restoring the state does not respect the previous context.
Some examples
Using the latest version 0.9.1.
And some example of using the same code with saving and reloading state between messages (model is completely unloaded and reloaded), except with CPU backend instead of CUDA.
I tried this more than a few times, and when using CUDA it always produces nonsense after the second message, which is the point it does the reloading of state.
I discovered this bug also exists in llama.cpp, and the workaround is to disable GPU KV cache offloading.
@elgatopanzon thanks for reporting that, saves us a lot of work investigating it! Do you happen to have a link to an upstream issue tracking this bug?
Here is the upstream issue I created https://github.com/ggerganov/llama.cpp/issues/4865