theia icon indicating copy to clipboard operation
theia copied to clipboard

Update Ollama version and support newest features

Open xpomul opened this issue 5 months ago • 3 comments

What it does

  • Update the ollama-js dependency to 0.5.16 to support ollama version 0.9.0
  • Add support for streaming tool calling
  • Add support for real thinking messages
  • Add support for images
  • Add support for Token Usage Counting

How to test

Make sure to update your local ollama installation to version 0.9.0. Also, if you have already used thinking models before, make sure to ollama pull those once again, because they were most likely updated for ollama 0.9.0 thinking support.

Test the thinking and streaming tool calling support, e.g., with qwen3:14b and ask the Coder agent to, e.g., "Find a class in the ai-core package, that is not yet tested and write a unit test for it".

Test the image processing functionality, e.g., with llava:7b, configure the Universal agent with this model, add an image to the prompt and ask the Universal agent what it can see on the image.

Test the Token Usage Counting by having a look in the corresponding AI Configuration tab after working with one of the ollama models.

Follow-ups

Breaking changes

  • [ ] This PR introduces breaking changes and requires careful review. If yes, the breaking changes section in the changelog has been updated.

Attribution

Review checklist

Reminder for reviewers

xpomul avatar Jun 09 '25 22:06 xpomul

@xpomul I've just noticed that non-stream requests are not properly handled... In case of 'Code Completion' the LanguageModelRequest contains a setting stream=false this needs to be handled. Also in this case it would be maybe good to not use thinking feature.

Thanks. I was not aware of that use-case. I have added the support for non-streaming requests back in.

A few remarks, though:

  • "thinking mode" is generally not supported in the non-streaming case. (But, on the other hand, "thinking mode" (i.e., think: true) does not actually toggle a thinking behavior in the model itself, it just changes the output from <think>...</think> to an explicit property in the resulting JSON to separate the content from the internal thoughts explicitly on the low level. Therefore, in case we use, e.g., the deepcoder model for code completions, you will still see <think>...</think> messages in the AI History View.
  • as noted in the code comment, unfortunately non-streaming requests cannot be aborted. We can only wait until the ollama.chat() promise is resolved... The original implementation before my change looked a bit like that was possible, but it would not work in practice.
  • I have tried multiple models, but I cannot achieve code completion results with ollama that work in any useful way. On the one hand, my M1 MacBook Pro is too slow to provide a code completion result within an appropriate time. On the other hand, the result usually contains more than the actual completion - independent of the prompt I use. I have checked the ollama docs - there seems to be a separate REST endpoint generate instead of chat in which a completion prefix and postfix can be given as parameters. Maybe that would work better, but we'd need to experiment with that. Also, we'd need a way then, to communicate the prefix and suffix separately from the prompt...

xpomul avatar Jun 14 '25 14:06 xpomul

Yes, I was talking to @JonasHelming already that we need to know when chat or code completion is requested . For now there is a marker for the agent that is hidden in the API, but not in the request itself. @xpomul we need to force that in request because that are two different tasks

dhuebner avatar Jun 14 '25 17:06 dhuebner

@xpomul In contrast to normal prompting, Theia is putting the content assist requests into the user message, that works well with payed services, but not with tiny local models. For me qwen3 on Ollama works somewhat good it cc tasks, try it :)

dhuebner avatar Jun 14 '25 17:06 dhuebner

@dhuebner I have given Code Completion a bit more thought and experimentation over the weekend. I have come up with an implementation of non-streaming requests that, i.e., still uses streaming under the hood. This has two advantages:

  1. If the code completion takes too long, we can abort the request immediately and thus, do not block the ollama process for the next completion
  2. I can specify think: true as part of a streaming request. As mentioned before, this will tell the ollama API to separate thoughts from content and we can filter out the thoughts. (As mentioned before, there is no way to disable thinking/reasoning in ollama, we can just filter out the reasoning part).

With the new streaming implementation Code Completion feels more performant than using non-streaming the Ollama API call.

I didn't fully understand what exactly your requested changes are. If you have suggestions which I didn't cover, please tell me (again ... ;-))

xpomul avatar Jun 16 '25 14:06 xpomul

@dhuebner Can I merge this? I assume you tested this?

JonasHelming avatar Jun 16 '25 14:06 JonasHelming

@JonasHelming Code completion is working now, other functionality was tested already. So yes, I think we can merge it

dhuebner avatar Jun 16 '25 14:06 dhuebner

See this nice blog post form more details: https://www.winklerweb.net/index.php/blog/12-eclipse/31-using-local-llms-in-theia-ai-with-ollama

JonasHelming avatar Jul 04 '25 14:07 JonasHelming