vscode icon indicating copy to clipboard operation
vscode copied to clipboard

Model providers should be community driven

Open isidorn opened this issue 6 months ago • 1 comments

@ggerganov makes a good point here https://x.com/ggerganov/status/1925587681519579594

In short the Ollama option should not be called Ollama but "Custom endpoint"

So ideally:

  1. We finalize the LanguageModelProvider API in June
  2. We decide which of the built-in providers should not be built-in (Cerebras, Groq)
  3. We reach out to all the built-in providers that should not be built-in and give them a heads up that they can now publish an extension
  4. We change the "Ollama" one to be called "Custom endpoint"
  5. We add some UI entry point to "Add more providers" -> shows extension view and filters on @tag:language-model-provider

fyi @sbatten @pierceboggan @jrieken

isidorn avatar May 23 '25 08:05 isidorn

Looking forward to having this feature supported on vscode! We will try our best on our side - llama.cpp - to make the setup as friction-less as possible for end-users, cc @ggerganov @slaren

ngxson avatar May 23 '25 09:05 ngxson

Note that the "Custom endpoint" should expect that the standard OpenAI APIs are being served (e.g. /v1/chat/completions). This could be configurable to support additional APIs, but the default should be the OAI API.

ggerganov avatar May 26 '25 18:05 ggerganov

Adding in that making it a Chat Completions endpoint will not only make the endpoint nearly universal across all LLM applications, as the vast majority of local and proprietary hosting solutions expose such an endpoint, but it would also simplify any confusion around the sampler settings that the endpoints will accept.

In terms of usability, the community will definitely get more bang for the buck by adding compatibility to an industry standard schema like chat completions, rather than focusing on a proprietary schema used by only a small handful of applications.

SomeOddCodeGuy avatar May 30 '25 19:05 SomeOddCodeGuy

If I go to set BYOK provider, but don't see what I'm looking for, there should be a More Providers option that populates based on what we see from extensions, similar to what we have for our theme configuration that's populated from marketplace. Obviously not a P0, but would be good UX to have.

pierceboggan avatar Jun 05 '25 06:06 pierceboggan

It is nice to support standard OpenAI APIs (e.g. the good old /v1/chat/completions). It guarantees wide support. But I wonder if there will be future native API support for OpenAI's Response API? (cc @lramos15 @jrieken) The Response API can increase tool calling performance by 6-10% and decrease token costs by increasing cache hit by up to 80% (claims from OpenAI's cookbook I can't find right now):

For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests (either using previous_response_id, or by taking all the output items from an older request and passing them in as input items for a new one). OpenAI will automatically include any relevant reasoning items in the model's context and ignore any irrelevant ones. In more advanced use‑cases where you’d like to manage what goes into the model's context more precisely, we recommend that you at least include all reasoning items between the latest function call and the previous user message. Doing this will ensure that the model doesn’t have to restart its reasoning when you respond to a function call, resulting in better function‑calling performance and lower overall token usage. https://platform.openai.com/docs/guides/reasoning-best-practices#how-to-keep-costs-low-and-accuracy-high

https://platform.openai.com/docs/guides/responses-vs-chat-completions

insilications avatar Jun 05 '25 07:06 insilications

It would be great if it is possible to have AI features without setting up a copilot account. I.e. completely offline mode.

solomatov avatar Jun 19 '25 15:06 solomatov

It is nice to support standard OpenAI APIs (e.g. the good old /v1/chat/completions). It guarantees wide support. But I wonder if there will be future native API support for OpenAI's Response API? (cc @lramos15 @jrieken) The Response API can increase tool calling performance by 6-10% and decrease token costs by increasing cache hit by up to 80% (claims from OpenAI's cookbook I can't find right now):

For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests (either using previous_response_id, or by taking all the output items from an older request and passing them in as input items for a new one). OpenAI will automatically include any relevant reasoning items in the model's context and ignore any irrelevant ones. In more advanced use‑cases where you’d like to manage what goes into the model's context more precisely, we recommend that you at least include all reasoning items between the latest function call and the previous user message. Doing this will ensure that the model doesn’t have to restart its reasoning when you respond to a function call, resulting in better function‑calling performance and lower overall token usage. https://platform.openai.com/docs/guides/reasoning-best-practices#how-to-keep-costs-low-and-accuracy-high

https://platform.openai.com/docs/guides/responses-vs-chat-completions

This is already supported via github.copilot.chat.byok.responsesApi

mubaidr avatar Aug 13 '25 15:08 mubaidr

It is nice to support standard OpenAI APIs (e.g. the good old /v1/chat/completions). It guarantees wide support. But I wonder if there will be future native API support for OpenAI's Response API? (cc @lramos15 @jrieken) The Response API can increase tool calling performance by 6-10% and decrease token costs by increasing cache hit by up to 80% (claims from OpenAI's cookbook I can't find right now):

For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests (either using previous_response_id, or by taking all the output items from an older request and passing them in as input items for a new one). OpenAI will automatically include any relevant reasoning items in the model's context and ignore any irrelevant ones. In more advanced use‑cases where you’d like to manage what goes into the model's context more precisely, we recommend that you at least include all reasoning items between the latest function call and the previous user message. Doing this will ensure that the model doesn’t have to restart its reasoning when you respond to a function call, resulting in better function‑calling performance and lower overall token usage. platform.openai.com/docs/guides/reasoning-best-practices#how-to-keep-costs-low-and-accuracy-high

platform.openai.com/docs/guides/responses-vs-chat-completions

This is already supported via github.copilot.chat.byok.responsesApi

Amazing!

insilications avatar Aug 13 '25 21:08 insilications

@mubaidr Do you know if the current implementation of Responses API is using tools and/or reasoning and passing in all reasoning items from previous requests by using previous_response_id? See https://platform.openai.com/docs/guides/latest-model#further-reading

Reusing reasoning context with the Responses API

We strongly recommend using the Responses API when using GPT-5 to unlock improved agentic flows, lower costs, and more efficient token usage in your applications.

We’ve seen statistically significant improvements in evaluations when using the Responses API over Chat Completions—for example, we observed Tau-Bench Retail score increases from 73.9% to 78.2% just by switching to the Responses API and including previous_response_id to pass back previous reasoning items into subsequent requests. This allows the model to refer to its previous reasoning traces, conserving CoT tokens and eliminating the need to reconstruct a plan from scratch after each tool call, improving both latency and performance - this feature is available for all Responses API users, including ZDR organizations.

insilications avatar Aug 13 '25 22:08 insilications

We have finalized the language model provider API, and here is the extension guide https://code.visualstudio.com/api/extension-guides/ai/language-model-chat-provider

In the meantime we have added a Custom OpenAI compatible endpoint. There is still work to be done (I have updated the initial comment).

Any feedback is very much welcome. Thank you 🙏

isidorn avatar Sep 01 '25 09:09 isidorn

In the meantime we have added a Custom OpenAI compatible endpoint.

I checked the latest version of VS Code, but I don't see the custom endpoint:

Image

Is this expected?

ggerganov avatar Sep 03 '25 10:09 ggerganov

@ggerganov i guess you can set a custom OpenAI compatible endpoint via this setting though it's a bit hidden:

Image

julien-c avatar Sep 03 '25 10:09 julien-c

ah or this actually: https://github.com/microsoft/vscode-copilot-chat/pull/621 but you'll need to define each model by model

julien-c avatar Sep 03 '25 10:09 julien-c

@ggerganov thanks for giving it a try. It is in https://code.visualstudio.com/insiders/ It will be in next VS Code stable (in 10 days).

Image

isidorn avatar Sep 03 '25 12:09 isidorn

@isidorn Is github copilot registration still required to use these models?

solomatov avatar Sep 03 '25 13:09 solomatov

@isidorn Is github copilot registration still required to use these models?

At this time, yes

lramos15 avatar Sep 03 '25 13:09 lramos15

@isidorn That is great, from Docker's side, we will add docs about how to wire up Docker Model Runner as a provider as soon as this is released 🙂

kiview avatar Sep 03 '25 15:09 kiview

@isidorn That is great, from Docker's side, we will add docs about how to wire up Docker Model Runner as a provider as soon as this is released 🙂

All of this is already available in VS Code insiders + pre-release if you want to get started. It will come to stable next week :)

lramos15 avatar Sep 03 '25 16:09 lramos15

Last minute we discovered some issues with the Open AI compatible provider and decided to push it out to next Stable (start of November). It is available in Insiders, so you can use it and provide feedback. Sorry about this 🙏

The finalization of the BYOK API will be shipped as part of this stable (Sept 11th) as planned.

isidorn avatar Sep 08 '25 09:09 isidorn

@isidorn Thank you and all the VS Code team for making the Chat and Copilot experience more customizable

insilications avatar Sep 09 '25 08:09 insilications

@isidorn Is github copilot registration still required to use these models?

@solomatov, see backlog issue:

  • #246551

That and:

  • #254473

are the last major impediments that we have to using VS Code as our agentic coding tool for research and production at our local institution.

bartlettroscoe avatar Sep 13 '25 13:09 bartlettroscoe

API is finalized closing

lramos15 avatar Sep 29 '25 18:09 lramos15

There are still a couple of follow-up items that we will do in the following (October) milestone:

  • [ ] Make discovery of extension language model providers dynamic and easy https://github.com/microsoft/vscode/issues/251267
  • [ ] Cleanup the list of providers that are built-in

isidorn avatar Sep 30 '25 09:09 isidorn

I was trying this in vscode insiders, but it fails to connect to my remote machine where I'm running llama.cpp. I get this from GitHub Copilot Chat in Output:

2025-11-03 15:08:48.881 [error] Error: net::ERR_TIMED_OUT
    at SimpleURLLoaderWrapper.<anonymous> (node:electron/js2c/utility_init:2:10610)
    at SimpleURLLoaderWrapper.emit (node:events:519:28)
    at SimpleURLLoaderWrapper.callbackTrampoline (node:internal/async_hooks:130:17): Error on conversation request
2025-11-03 15:08:48.881 [info] ccreq:e539bd27.copilotmd | networkError | GPT-OSS:120b | 30012ms | [copilotLanguageModelWrapper]
2025-11-03 15:08:48.887 [info] ccreq:62bfb744.copilotmd | failed | GPT-OSS:120b | 30047ms | [panel/unknown-external]

Earlier in the Output I see:

2025-11-03 15:12:27.697 [info] BYOK: Resolved URL for model CustomOAI/GPT-OSS:120b: http://redacted-hostname:8000/v1/chat/completions

This endpoint is correct, since if I curl on that endpoint:

$ curl http://redacted-hostname:8000/v1/chat/completions
{"error":{"code":401,"message":"Invalid API Key","type":"authentication_error"}}% 

If I proxy that endpoint locally and change the endpoint to http://localhost:8000 it does work as expected.

For what it's worth: that endpoint is connected to my machine using Tailscale.

Insider version:

Version: 1.106.0-insider
Commit: 13330e1420d478afab7ce453f6d1e307abf7412e
Date: 2025-11-03T05:02:49.231Z
Electron: 37.7.0
ElectronBuildId: 12597478
Chromium: 138.0.7204.251
Node.js: 22.20.0
V8: 13.8.258.32-electron.0
OS: Linux x64 6.17.6-arch1-1

hgaiser avatar Nov 03 '25 14:11 hgaiser

@hgaiser Please file a new issue so we can investigate

lramos15 avatar Nov 03 '25 16:11 lramos15