litellm [Feature]: Implement /api/generate for Continue.dev FIM / autocompletion with Ollama?

The Feature

I am using Ollama as a Backend for my models. In Continue.dev I want to use Qwen2.5 1.5B to autocomplete my code. This works perfectly if set up the config to directly talk to the Ollama API under http://ollamahostip:11434**/api/generate**.

I never got it to work with directly talking to the LiteLLM-API (using mistral api or openai api) so I tried the pass-through function and that finally worked. I have two PCs running the same model as redundancy, so if I set up a pass-through, only one server would be utilized.

I also use Langfuse for Monitoring the requests, and when using pass-through the API User is not visible. My questions, are there any plans to implement /api/generate ?

Thank you very much! Best regards, Robert

Motivation, pitch

I want to always use LiteLLM for all my AI-API-Requests, it would be great if the endpoint /api/generate can be implemented.

Twitter / LinkedIn details

No response

Nov 25 '24 21:11 deliciousbob

ollama /api/generate is already supported - https://github.com/BerriAI/litellm/blob/8673f2541e88331ff576df36b21cb55ceecd0330/litellm/llms/ollama.py#L284

can you share a sample request to repro the issue? for FIM tasks we recommend using the /completions endpoint, not /chat/completions

Nov 26 '24 09:11 krrishdholakia

Hi Krish, thank you for the quick reply. I was not able to find the /api/generate in the Swagger of Litellm (https://litellm-api.up.railway.app/) Contiune.dev tries to directly contact url:port**/api/generate** when selecting ollama as provider. (i added the LiteLLM url:4000 as a baseurl to handle the requests) They do not support OpenAI API as OpenAI does not support FIM they mentioned, so only Ollama API or Mistral is supported. (see: https://docs.continue.dev/autocomplete/model-setup).

Nov 26 '24 14:11 deliciousbob

We have made some test setups and can confirm that LiteLLM breaks the FIM access to Ollama API.

Note following test results using latest Continue.dev plugin (also pre-release):

VS Code (continue.dev) --> Open WebUI --> Ollama: works!
VS Code (continue.dev) --> Open WebUI --> LiteLLM --> Ollama: fail!
VS Code (continue.dev) --> LiteLLM --> Ollama: fail!
VS Code (continue.dev) --> Ollama: works!

This is unfortunate as it prevents us using LiteLLM as the enterprise central AI gateway

Jan 09 '25 09:01 universam1

Can confirm the same behavior, VS Code -> LiteLLM -> Ollama fails, wheres VS Code -> Ollama works. Looking at the prompt output while failing it seems that the answer received is more of an instruct-model answer (e.g. The given snippet is a python code that try to perform .....

Jan 09 '25 10:01 wizche

any workaround for this?

Feb 28 '25 03:02 raihan0824

similar issue with vllm

only receiving <|fim_middle|> as response (using qwen coder)

Mar 14 '25 06:03 JakubCerven

We can use Pass Through Endpoints to do this. https://docs.litellm.ai/docs/proxy/pass_through

Here is an litellm example config file:

general_settings:
  pass_through_endpoints:
    - path: "/api/generate"
      target: "http://localhost:11434/api/generate"
      forward_headers: True

And in continue config file, the corresponding provider should be ollama, not the openai.

But this way doesn't feel elegant enough, because it bypasses a lot of the features provided by litellm.

Mar 17 '25 03:03 ggqshr

Hey all, what is the required change on our end, to support this?

Mar 17 '25 03:03 krrishdholakia

Our ollama/ route already called /api/generate

Mar 17 '25 03:03 krrishdholakia

I guess the cause of the problem is that the prompt is different, but i don't know how to fix this, adding a ollama_fim model prefix ? https://github.com/BerriAI/litellm/blob/97ed4d3a16414d22764e95b2cf39ed8672206734/litellm/litellm_core_utils/prompt_templates/factory.py#L215

When continue's provider set to openai, among the parameters received by the /completions interface of litellm, the beginning of the prompt is as follows：

But the prompt field recv by ollama /api/generate endpoint is like this:

That cause follow issue.

Can confirm the same behavior, VS Code -> LiteLLM -> Ollama fails, wheres VS Code -> Ollama works. Looking at the prompt output while failing it seems that the answer received is more of an instruct-model answer (e.g. The given snippet is a python code that try to perform .....

For comparison, when using the passthrough mode, the prompt received by ollama does not have the ### User prefix, so it's working. @krrishdholakia

Mar 17 '25 07:03 ggqshr

Oh - i think i know the fix - we can just support the non-templated call via /completions

It makes sense to do this on /chat/completions since that's what the route is asking for.

Are you able to call /completions on vs code?

Mar 17 '25 15:03 krrishdholakia

When continue's provider set to openai, among the parameters received by the /completions interface of litellm, the beginning of the prompt is as follows：

The openai provider in continue will call litellm /completions endpoint (now vscode is just calling /completions directly.), but then litellm will call ollama /api/generate endpoint with ### User: prefix in prompt field.

Mar 18 '25 01:03 ggqshr

That's fixable - thanks for confirming that

Mar 18 '25 01:03 krrishdholakia

Thx for the fix, I've tested it with the lates stable version. It is forwarding the prompt to ollama to :11434/api/generate But it directly crashes my Ollama instance with Qwen2.5-Code-3B and restarts Ollama again.

I get that warning before it crashed: time=2025-04-09T11:16:22.074+02:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-04-09T11:16:22.076+02:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128

Do i need to configure some additional parameter manually on LiteLLM ? On Continue.dev i have simply configured it the following way:

name: autocomplete-prod provider: openai model: autocomplete-prod apiBase: http://
:4000 apiKey:
defaultCompletionOptions: contextLength: 8000 maxTokens: 1500 roles:
- autocomplete

In the LiteLLM-logs I see that the prompt is forwarded to Ollama and was successful but it generated no response (Token usage 592(591+1) )

On LiteLLM:

model_name: autocomplete-prod litellm_params: api_base: http://:11434 api_key: ollama model: ollama/qwen2.5-coder:3b drop_params: true

Thx for your help! best regards, Robert

Apr 09 '25 14:04 deliciousbob

Doesn't that sound like an ollama issue at this point? @deliciousbob

Please let me know if you see something we can do better here

Apr 09 '25 15:04 krrishdholakia

additional

@krrishdholakia same issue, after debugging, same the model output first word is "```" and continue shutdown the stream?

May 26 '25 12:05 aleccai8

this sounds like an ollama issue - am i missing something?

You can see the request being sent by litellm to ollama by enabling debug logs - --detailed_debug and looking for the "POST Request Sent from LiteLLM:" string - https://docs.litellm.ai/docs/proxy/debugging#debug-logs

May 26 '25 16:05 krrishdholakia

this sounds like an ollama issue - am i missing something?

You can see the request being sent by litellm to ollama by enabling debug logs - --detailed_debug and looking for the "POST Request Sent from LiteLLM:" string - https://docs.litellm.ai/docs/proxy/debugging#debug-logs

Something doesn't add up here. I don't know where the problem is either — maybe this?

Maybe Lite LLM is using the chat template instead of completion.
Or maybe Ollama itself is doing it, because LiteLLM is changing /generate to /chat?

@krrishdholakia

May 26 '25 18:05 mlibre

@mlibre whats the config and request to proxy when you see the /generate changing to /chat?

May 26 '25 18:05 krrishdholakia

@mlibre whats the config and request to proxy when you see the /generate changing to /chat?

Sorry for the confusion — my mistake. I thought I saw it in the LiteLLM documentation, but it’s not actually there!

Still, last month I tried everything to get it working. However, the responses came back as chat-like, not as completions. I'm pretty sure the model was interpreting the request as a chat request.

When I changed the settings to send the request directly to the ollama server, everything worked perfectly!

@krrishdholakia

May 27 '25 17:05 mlibre

Hi,

i am also facing same issue with integration with LiteLLM models via continue, is there any workaround/fix for it ?

Thanks

Aug 15 '25 15:08 tarekabouzeid

@tarekabouzeid what issue do you see ? Can you file a new github issue with steps to repro

Aug 15 '25 15:08 ishaan-jaff

@tarekabouzeid what issue do you see ? Can you file a new github issue with steps to repro

some of our users reported getting the below and are not getting any usable results compared to Github co-pilot models:

they tried several other models, and getting same warning. But i am not 100% sure if its integration issue due to prompting or models they tested just aren't compatible with this feature. What do you think ?

models tested: llama-33-70b-instruct, llama-31-8b-instruct, mistral-24b-instruct, salamandra-7b-instruct, qwen-32b-instruct

Aug 15 '25 15:08 tarekabouzeid

Only models that are trained for FIM are compatible, for autocomplete you want a small fast model (qwen2.5-coder-3b works well) and for Chat you ask a bigger model, like codestral or any model >30B (gpt4.1)

I managed to get it to work with LiteLLM and vLLM as a Backend. Use „server:port/v1“ as base_url and lmstudio as the provider (even if you use vllm, at least that worked for me)

Aug 15 '25 16:08 deliciousbob