eval-dev-quality Preload/Unload Ollama models before prompting

For better measurements we need to preload the ollama model before prompting to it. We also need to cleanup afterwards

Tasks:

[x] Check how to preload models - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-pre-load-a-model-to-get-faster-response-times
[x] Check how to unload models - https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately
[x] Check how to query if a model is loaded - https://www.reddit.com/r/ollama/comments/1cex92f/possible_to_show_currently_loaded_models_via_api/
- newest version has ollama ps to check all loaded models
- however, if we use an empty prompt request to trigger model pre-loading, we can be sure that after the API answers this request, the model is indeed loaded see here
[x] Implement it into the evaluation run to preload ollama models when they should be used

May 15 '24 09:05 Munsio

So the problem is to know when the preloading process has finished, right? An empty request starts the preload... Could it be that this request is completed once the model is finished loading?

May 15 '24 10:05 bauersimon

So the problem is to know when the preloading process has finished, right? An empty request starts the preload... Could it be that this request is completed once the model is finished loading?

Would say yes. If the model answers, it is loaded.

May 15 '24 10:05 zimmski

@bauersimon @zimmski updated the description on "check if model is loaded"

May 15 '24 10:05 Munsio

I meant that I believe the API only completes the request once the model is loaded (I think it's happening here). So there is no need for the artificial "respond with y" query. Easy to verify with a server and CURL and two differently sized models. If the response for an empty request from the API is consistently slower for the bigger model, the response must happen only when the loading is done.

May 15 '24 12:05 bauersimon

Checked again and now I am 100% certain we don't need to send a dummy prompt as the API also responds to an empty request only after the model is loaded. The Ollama API has a response property load_duration and while that property is not returned on an empty request, we can see here how it is computed. And indeed the checkpointLoaded := time.Now() happens directly after the special case where the empty prompt request is handled. Hence, when the empty prompt request is answered, the model is already loaded.

Also tried this with some curl requests and loading qwen:0.5b took the empty request always ~2000ms (plus/minus a few ms) and loading qwen:4b took the empty request always ~2300ms (plus/minus a few ms) to complete.

In comparison, asking quen:0.5b to "respond with y" took 4000ms... so double... probably even worse for larger models.

May 15 '24 13:05 bauersimon

@Munsio / @ruiAzevedo19 since #121 is merged, this issue is done? Or did you encounter any other things we need to take a look at?

May 27 '24 07:05 bauersimon