LLamaSharp LoadState() not restoring context when using CUDA backend?

LoadState() not restoring context when using CUDA backend?

Open elgatopanzon opened this issue 1 year ago • 4 comments

I'm implementing a Save State and Load State system to pick up past conversations. I did all my development using CPU and just recently got a GPU to try it with. On CPU everything works as expected by saving the context to disk, then later reloading it, and sending on the new messages. It respects the past context and provides a response.

I came across what looks like a bug with the context restoration when using GPU, though. It appears as though restoring the state does not respect the previous context.

Some examples Screenshot_termite_20240109003629 Screenshot_termite_20240109001822

Using the latest version 0.9.1.

Jan 09 '24 06:01 elgatopanzon

And some example of using the same code with saving and reloading state between messages (model is completely unloaded and reloaded), except with CPU backend instead of CUDA.

Screenshot_select-area_20240109001431

I tried this more than a few times, and when using CUDA it always produces nonsense after the second message, which is the point it does the reloading of state.

Jan 09 '24 06:01 elgatopanzon

I discovered this bug also exists in llama.cpp, and the workaround is to disable GPU KV cache offloading.

Jan 10 '24 01:01 elgatopanzon

@elgatopanzon thanks for reporting that, saves us a lot of work investigating it! Do you happen to have a link to an upstream issue tracking this bug?

Jan 10 '24 14:01 martindevans

Here is the upstream issue I created https://github.com/ggerganov/llama.cpp/issues/4865

Jan 10 '24 19:01 elgatopanzon

LLamaSharp LLamaSharp copied to clipboard

LoadState() not restoring context when using CUDA backend?

LLamaSharp
LLamaSharp copied to clipboard