continue icon indicating copy to clipboard operation
continue copied to clipboard

Support ollama's new `keep_alive` request parameter to prevent model unloading

Open noahhaon opened this issue 1 year ago • 3 comments

Validations

  • [X] I believe this is a way to improve. I'll try to join the Continue Discord for questions
  • [X] I'm not able to find an open issue that requests the same enhancement

Problem

As a user of Continue with the ollama LLM serving backend, I frequently experience long delays in responses in my VSCode/Continue workflow as ollama unloads the model and weights after 5 minutes by default. Ollama recently added support for the keep_alive parameter in requests which can prevent unloading or make the model in-memory persistence configurable.

Solution

Please add support for configuring the keep_alive parameter and adding it to inference requests sent to the ollama backend.

The parameter has been added to ollama 0.1.23 through the merged pull request here:

https://github.com/ollama/ollama/pull/2146

noahhaon avatar Jan 29 '24 08:01 noahhaon

Love the idea! Working on tab-autocomplete release probably for the next couple of days, but will add this later this week

sestinj avatar Jan 30 '24 19:01 sestinj

@noahhaon out of curiosity, what would you plan on setting this to? I am going to set a higher default even if the user doesn't configure it, and if this would be enough for your case, I tend to like avoiding too much extra configuration if it's not necessary. I was thinking 30min for default.

Either way will add this default today and if config is necessary, then will do that later

changes: https://github.com/continuedev/continue/commit/2fab0e2cdaf6e1314c7da270ce69a0f4a473e8d7

sestinj avatar Jan 30 '24 19:01 sestinj

30 minute is definitely better than 5, but honestly my use case is probably not the typical one. I have a RTX 3090, so plenty of VRAM to have a 6-7b model loaded all the time. The ollama devs justified their default of 5 minutes to not hog VRAM on smaller GPUs, which is probably the more common use case.

For me I would prefer to disable unloading entirely (keep_alive: -1), since I can just stop the ollama docker container when I'm done with my development session. I imagine it would be useful to configure this per-model, since the size of the model is probably the biggest factor affecting this setting. For example, just exposing this as a configuration option in config.json for ModelDescription would be super useful for me.

Thanks for your work on this extension, it's been great watching its development and has been fun to use!

noahhaon avatar Jan 31 '24 07:01 noahhaon

Joining @noahhaon here to say that, I too have plenty of vram sitting around idle sipping piña coladas, and would love to make better use of it! 🍹

Having a keep_alive : -1 or a similar config parameter that allows keeping the model loaded for at least 6 - 8 hours to cover a working day would be amazing. Otherwise, going to a 30min meeting and coming back to the desk requires a much longer wait time for the model to load up. It would improve Continue's responsiveness by a very large margin.

— btw, thanks a lot for making continue @sestinj! This is a total game changer, especially for coders who have businesses or clients etc that are more privacy-sensitive. 🙏🏻

johnozbay avatar Mar 04 '24 13:03 johnozbay

@noahhaon @johnozbay I can't believe it took me this long to make such a simple commit, but here it is!: https://github.com/continuedev/continue/commit/2614465db72eee89954a28c16546c46a1014fe57. Will be in next pre-release

I was a bit concerned about adding tons of parameters, but to be honest this one feels pretty fundamental. In the future if there are any others you want that we don't support, we also recently added requestOptions.body to config.json, so you can add arbitrary params to the HTTP POST request body

Now we can all sip piña coladas by the pool while the GPUs/CPUs are writing our code : )

sestinj avatar Mar 07 '24 00:03 sestinj

@sestinj You're the legend! Thank you so much! Also, love the requestOptions addition idea, will definitely help a lot of users!

johnozbay avatar Mar 07 '24 11:03 johnozbay

😍

Thank you @sestinj ! I thought it would be about as simple as your commit when I first looked through the code, but I didn't want to assume 😅 This will be very valuable, and give some nice forward compatibility as ollama develops too. Thanks!!

noahhaon avatar Mar 07 '24 12:03 noahhaon

Note that this parameter is currently not supported by the openai compatible api. I found this issue when adding the keep_alive parameter to a ChatGPTBox project. I am considering deprecating openai-compatible interfaces and directly calling ollama native api

lzskyline avatar Jul 10 '24 07:07 lzskyline