[Feature]: Implement /api/generate for Continue.dev FIM / autocompletion with Ollama?
The Feature
I am using Ollama as a Backend for my models. In Continue.dev I want to use Qwen2.5 1.5B to autocomplete my code. This works perfectly if set up the config to directly talk to the Ollama API under http://ollamahostip:11434**/api/generate**.
I never got it to work with directly talking to the LiteLLM-API (using mistral api or openai api) so I tried the pass-through function and that finally worked. I have two PCs running the same model as redundancy, so if I set up a pass-through, only one server would be utilized.
I also use Langfuse for Monitoring the requests, and when using pass-through the API User is not visible. My questions, are there any plans to implement /api/generate ?
Thank you very much! Best regards, Robert
Motivation, pitch
I want to always use LiteLLM for all my AI-API-Requests, it would be great if the endpoint /api/generate can be implemented.
Twitter / LinkedIn details
No response
ollama /api/generate is already supported - https://github.com/BerriAI/litellm/blob/8673f2541e88331ff576df36b21cb55ceecd0330/litellm/llms/ollama.py#L284
can you share a sample request to repro the issue? for FIM tasks we recommend using the /completions endpoint, not /chat/completions
Hi Krish, thank you for the quick reply. I was not able to find the /api/generate in the Swagger of Litellm (https://litellm-api.up.railway.app/) Contiune.dev tries to directly contact url:port**/api/generate** when selecting ollama as provider. (i added the LiteLLM url:4000 as a baseurl to handle the requests) They do not support OpenAI API as OpenAI does not support FIM they mentioned, so only Ollama API or Mistral is supported. (see: https://docs.continue.dev/autocomplete/model-setup).
We have made some test setups and can confirm that LiteLLM breaks the FIM access to Ollama API.
Note following test results using latest Continue.dev plugin (also pre-release):
- VS Code (continue.dev) --> Open WebUI --> Ollama: works!
- VS Code (continue.dev) --> Open WebUI --> LiteLLM --> Ollama: fail!
- VS Code (continue.dev) --> LiteLLM --> Ollama: fail!
- VS Code (continue.dev) --> Ollama: works!
This is unfortunate as it prevents us using LiteLLM as the enterprise central AI gateway
Can confirm the same behavior, VS Code -> LiteLLM -> Ollama fails, wheres VS Code -> Ollama works.
Looking at the prompt output while failing it seems that the answer received is more of an instruct-model answer (e.g. The given snippet is a python code that try to perform .....
any workaround for this?
similar issue with vllm
only receiving <|fim_middle|> as response (using qwen coder)
We can use Pass Through Endpoints to do this. https://docs.litellm.ai/docs/proxy/pass_through
Here is an litellm example config file:
general_settings:
pass_through_endpoints:
- path: "/api/generate"
target: "http://localhost:11434/api/generate"
forward_headers: True
And in continue config file, the corresponding provider should be ollama, not the openai.
But this way doesn't feel elegant enough, because it bypasses a lot of the features provided by litellm.
Hey all, what is the required change on our end, to support this?
Our ollama/ route already called /api/generate
I guess the cause of the problem is that the prompt is different, but i don't know how to fix this, adding a ollama_fim model prefix ?
https://github.com/BerriAI/litellm/blob/97ed4d3a16414d22764e95b2cf39ed8672206734/litellm/litellm_core_utils/prompt_templates/factory.py#L215
When continue's provider set to openai, among the parameters received by the /completions interface of litellm, the beginning of the prompt is as follows:
But the prompt field recv by ollama /api/generate endpoint is like this:
That cause follow issue.
Can confirm the same behavior,
VS Code -> LiteLLM -> Ollamafails, wheresVS Code -> Ollamaworks. Looking at the prompt output while failing it seems that the answer received is more of an instruct-model answer (e.g.The given snippet is a python code that try to perform .....
For comparison, when using the passthrough mode, the prompt received by ollama does not have the ### User prefix, so it's working.
@krrishdholakia
Oh - i think i know the fix - we can just support the non-templated call via /completions
It makes sense to do this on /chat/completions since that's what the route is asking for.
Are you able to call /completions on vs code?
When continue's provider set to openai, among the parameters received by the /completions interface of litellm, the beginning of the prompt is as follows:
The openai provider in continue will call litellm /completions endpoint (now vscode is just calling /completions directly.), but then litellm will call ollama /api/generate endpoint with ### User: prefix in prompt field.
That's fixable - thanks for confirming that
Thx for the fix, I've tested it with the lates stable version.
It is forwarding the prompt to ollama to
I get that warning before it crashed: time=2025-04-09T11:16:22.074+02:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-04-09T11:16:22.076+02:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
Do i need to configure some additional parameter manually on LiteLLM ? On Continue.dev i have simply configured it the following way:
- name: autocomplete-prod
provider: openai
model: autocomplete-prod
apiBase: http://
:4000 apiKey: defaultCompletionOptions: contextLength: 8000 maxTokens: 1500 roles: - autocomplete
In the LiteLLM-logs I see that the prompt is forwarded to Ollama and was successful but it generated no response (Token usage 592(591+1) )
On LiteLLM:
- model_name: autocomplete-prod
litellm_params:
api_base: http://
:11434 api_key: ollama model: ollama/qwen2.5-coder:3b drop_params: true
Thx for your help! best regards, Robert
Doesn't that sound like an ollama issue at this point? @deliciousbob
Please let me know if you see something we can do better here
additional
@krrishdholakia same issue, after debugging, same the model output first word is "```" and continue shutdown the stream?
this sounds like an ollama issue - am i missing something?
You can see the request being sent by litellm to ollama by enabling debug logs - --detailed_debug and looking for the "POST Request Sent from LiteLLM:" string - https://docs.litellm.ai/docs/proxy/debugging#debug-logs
this sounds like an ollama issue - am i missing something?
You can see the request being sent by litellm to ollama by enabling debug logs -
--detailed_debugand looking for the "POST Request Sent from LiteLLM:" string - https://docs.litellm.ai/docs/proxy/debugging#debug-logs
Something doesn't add up here. I don't know where the problem is either — maybe this?
Maybe Lite LLM is using the chat template instead of completion.
Or maybe Ollama itself is doing it, because LiteLLM is changing /generate to /chat?
@krrishdholakia
@mlibre whats the config and request to proxy when you see the /generate changing to /chat?
@mlibre whats the config and request to proxy when you see the /generate changing to /chat?
Sorry for the confusion — my mistake. I thought I saw it in the LiteLLM documentation, but it’s not actually there!
Still, last month I tried everything to get it working. However, the responses came back as chat-like, not as completions. I'm pretty sure the model was interpreting the request as a chat request.
When I changed the settings to send the request directly to the ollama server, everything worked perfectly!
@krrishdholakia
Hi,
i am also facing same issue with integration with LiteLLM models via continue, is there any workaround/fix for it ?
Thanks
@tarekabouzeid what issue do you see ? Can you file a new github issue with steps to repro
@tarekabouzeid what issue do you see ? Can you file a new github issue with steps to repro
some of our users reported getting the below and are not getting any usable results compared to Github co-pilot models:
they tried several other models, and getting same warning. But i am not 100% sure if its integration issue due to prompting or models they tested just aren't compatible with this feature. What do you think ?
models tested: llama-33-70b-instruct, llama-31-8b-instruct, mistral-24b-instruct, salamandra-7b-instruct, qwen-32b-instruct
Only models that are trained for FIM are compatible, for autocomplete you want a small fast model (qwen2.5-coder-3b works well) and for Chat you ask a bigger model, like codestral or any model >30B (gpt4.1)
I managed to get it to work with LiteLLM and vLLM as a Backend. Use „server:port/v1“ as base_url and lmstudio as the provider (even if you use vllm, at least that worked for me)