eval-dev-quality
eval-dev-quality copied to clipboard
Follow up: Ollama Support
- [x] Introduce an "ID" method to the tool interface (like we have for model and provider) so the tools can be addressed deterministically instead of using the BinaryName method, which depends on the OS
- [ ] Allow to filter by tool IDs in command
install-toolsand add the following test for all OSes (currently we only test Linux there):
validate(t, &testCase{
Name: "Filtered",
Arguments: []string{"symflower"},
ExpectedInstalledToolNames: []string{
"symflower" + osutil.BinaryExtension(),
},
})
- [x] Make Ollama version dependent. We want to use a minimum version like we do with Symflower. There is surely lot of code that we can share. (latest version is usually also faster!)
- [ ] how to integration test Ollama in the CI? is there a "dummy" model that always does the same thing?
- https://github.com/ollama/ollama/blob/1b0e6c9c0e5d53aa6110530da0befab7c95d1755/integration/llm_test.go
- https://github.com/ollama/ollama/issues/4196
- [x] use random ports for testing to avoid the synchronization of a single Ollama instance
- [x] run models that are not pulled yet
- [x] query available models https://github.com/ollama/ollama/issues/3922
- [x] download selected models before the evaluation starts
- [ ] better integration testing
- [ ] we currently just test with a small model that it does not error, but it would be nicer to have something deterministic https://github.com/ollama/ollama/issues/4196
- [x] allow to customize the Ollama server port (and host?) and remove the workaround that restricts to running only one test (depending on Ollama) at a time
- [x] comment why we have a wait delay in the exec util