vscode
vscode copied to clipboard
Model providers should be community driven
@ggerganov makes a good point here https://x.com/ggerganov/status/1925587681519579594
In short the Ollama option should not be called Ollama but "Custom endpoint"
So ideally:
- We finalize the LanguageModelProvider API in June
- We decide which of the built-in providers should not be built-in (Cerebras, Groq)
- We reach out to all the built-in providers that should not be built-in and give them a heads up that they can now publish an extension
- We change the "Ollama" one to be called "Custom endpoint"
- We add some UI entry point to "Add more providers" -> shows extension view and filters on @tag:language-model-provider
fyi @sbatten @pierceboggan @jrieken
Looking forward to having this feature supported on vscode! We will try our best on our side - llama.cpp - to make the setup as friction-less as possible for end-users, cc @ggerganov @slaren
Note that the "Custom endpoint" should expect that the standard OpenAI APIs are being served (e.g. /v1/chat/completions). This could be configurable to support additional APIs, but the default should be the OAI API.
Adding in that making it a Chat Completions endpoint will not only make the endpoint nearly universal across all LLM applications, as the vast majority of local and proprietary hosting solutions expose such an endpoint, but it would also simplify any confusion around the sampler settings that the endpoints will accept.
In terms of usability, the community will definitely get more bang for the buck by adding compatibility to an industry standard schema like chat completions, rather than focusing on a proprietary schema used by only a small handful of applications.
If I go to set BYOK provider, but don't see what I'm looking for, there should be a More Providers option that populates based on what we see from extensions, similar to what we have for our theme configuration that's populated from marketplace. Obviously not a P0, but would be good UX to have.
It is nice to support standard OpenAI APIs (e.g. the good old /v1/chat/completions). It guarantees wide support. But I wonder if there will be future native API support for OpenAI's Response API? (cc @lramos15 @jrieken) The Response API can increase tool calling performance by 6-10% and decrease token costs by increasing cache hit by up to 80% (claims from OpenAI's cookbook I can't find right now):
For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests (either using previous_response_id, or by taking all the output items from an older request and passing them in as input items for a new one). OpenAI will automatically include any relevant reasoning items in the model's context and ignore any irrelevant ones. In more advanced use‑cases where you’d like to manage what goes into the model's context more precisely, we recommend that you at least include all reasoning items between the latest function call and the previous user message. Doing this will ensure that the model doesn’t have to restart its reasoning when you respond to a function call, resulting in better function‑calling performance and lower overall token usage. https://platform.openai.com/docs/guides/reasoning-best-practices#how-to-keep-costs-low-and-accuracy-high
https://platform.openai.com/docs/guides/responses-vs-chat-completions
It would be great if it is possible to have AI features without setting up a copilot account. I.e. completely offline mode.
It is nice to support standard OpenAI APIs (e.g. the good old /v1/chat/completions). It guarantees wide support. But I wonder if there will be future native API support for OpenAI's Response API? (cc @lramos15 @jrieken) The Response API can increase tool calling performance by 6-10% and decrease token costs by increasing cache hit by up to 80% (claims from OpenAI's cookbook I can't find right now):
For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests (either using previous_response_id, or by taking all the output items from an older request and passing them in as input items for a new one). OpenAI will automatically include any relevant reasoning items in the model's context and ignore any irrelevant ones. In more advanced use‑cases where you’d like to manage what goes into the model's context more precisely, we recommend that you at least include all reasoning items between the latest function call and the previous user message. Doing this will ensure that the model doesn’t have to restart its reasoning when you respond to a function call, resulting in better function‑calling performance and lower overall token usage. https://platform.openai.com/docs/guides/reasoning-best-practices#how-to-keep-costs-low-and-accuracy-high
https://platform.openai.com/docs/guides/responses-vs-chat-completions
This is already supported via github.copilot.chat.byok.responsesApi
It is nice to support standard OpenAI APIs (e.g. the good old /v1/chat/completions). It guarantees wide support. But I wonder if there will be future native API support for OpenAI's Response API? (cc @lramos15 @jrieken) The Response API can increase tool calling performance by 6-10% and decrease token costs by increasing cache hit by up to 80% (claims from OpenAI's cookbook I can't find right now):
For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests (either using previous_response_id, or by taking all the output items from an older request and passing them in as input items for a new one). OpenAI will automatically include any relevant reasoning items in the model's context and ignore any irrelevant ones. In more advanced use‑cases where you’d like to manage what goes into the model's context more precisely, we recommend that you at least include all reasoning items between the latest function call and the previous user message. Doing this will ensure that the model doesn’t have to restart its reasoning when you respond to a function call, resulting in better function‑calling performance and lower overall token usage. platform.openai.com/docs/guides/reasoning-best-practices#how-to-keep-costs-low-and-accuracy-high
platform.openai.com/docs/guides/responses-vs-chat-completions
This is already supported via
github.copilot.chat.byok.responsesApi
Amazing!
@mubaidr Do you know if the current implementation of Responses API is using tools and/or reasoning and passing in all reasoning items from previous requests by using previous_response_id? See https://platform.openai.com/docs/guides/latest-model#further-reading
Reusing reasoning context with the Responses API
We strongly recommend using the Responses API when using GPT-5 to unlock improved agentic flows, lower costs, and more efficient token usage in your applications.
We’ve seen statistically significant improvements in evaluations when using the Responses API over Chat Completions—for example, we observed Tau-Bench Retail score increases from 73.9% to 78.2% just by switching to the Responses API and including previous_response_id to pass back previous reasoning items into subsequent requests. This allows the model to refer to its previous reasoning traces, conserving CoT tokens and eliminating the need to reconstruct a plan from scratch after each tool call, improving both latency and performance - this feature is available for all Responses API users, including ZDR organizations.
We have finalized the language model provider API, and here is the extension guide https://code.visualstudio.com/api/extension-guides/ai/language-model-chat-provider
In the meantime we have added a Custom OpenAI compatible endpoint. There is still work to be done (I have updated the initial comment).
Any feedback is very much welcome. Thank you 🙏
In the meantime we have added a Custom OpenAI compatible endpoint.
I checked the latest version of VS Code, but I don't see the custom endpoint:
Is this expected?
@ggerganov i guess you can set a custom OpenAI compatible endpoint via this setting though it's a bit hidden:
ah or this actually: https://github.com/microsoft/vscode-copilot-chat/pull/621 but you'll need to define each model by model
@ggerganov thanks for giving it a try. It is in https://code.visualstudio.com/insiders/ It will be in next VS Code stable (in 10 days).
@isidorn Is github copilot registration still required to use these models?
@isidorn That is great, from Docker's side, we will add docs about how to wire up Docker Model Runner as a provider as soon as this is released 🙂
@isidorn That is great, from Docker's side, we will add docs about how to wire up Docker Model Runner as a provider as soon as this is released 🙂
All of this is already available in VS Code insiders + pre-release if you want to get started. It will come to stable next week :)
Last minute we discovered some issues with the Open AI compatible provider and decided to push it out to next Stable (start of November). It is available in Insiders, so you can use it and provide feedback. Sorry about this 🙏
The finalization of the BYOK API will be shipped as part of this stable (Sept 11th) as planned.
@isidorn Thank you and all the VS Code team for making the Chat and Copilot experience more customizable
@isidorn Is github copilot registration still required to use these models?
@solomatov, see backlog issue:
- #246551
That and:
- #254473
are the last major impediments that we have to using VS Code as our agentic coding tool for research and production at our local institution.
API is finalized closing
There are still a couple of follow-up items that we will do in the following (October) milestone:
- [ ] Make discovery of extension language model providers dynamic and easy https://github.com/microsoft/vscode/issues/251267
- [ ] Cleanup the list of providers that are built-in
I was trying this in vscode insiders, but it fails to connect to my remote machine where I'm running llama.cpp. I get this from GitHub Copilot Chat in Output:
2025-11-03 15:08:48.881 [error] Error: net::ERR_TIMED_OUT
at SimpleURLLoaderWrapper.<anonymous> (node:electron/js2c/utility_init:2:10610)
at SimpleURLLoaderWrapper.emit (node:events:519:28)
at SimpleURLLoaderWrapper.callbackTrampoline (node:internal/async_hooks:130:17): Error on conversation request
2025-11-03 15:08:48.881 [info] ccreq:e539bd27.copilotmd | networkError | GPT-OSS:120b | 30012ms | [copilotLanguageModelWrapper]
2025-11-03 15:08:48.887 [info] ccreq:62bfb744.copilotmd | failed | GPT-OSS:120b | 30047ms | [panel/unknown-external]
Earlier in the Output I see:
2025-11-03 15:12:27.697 [info] BYOK: Resolved URL for model CustomOAI/GPT-OSS:120b: http://redacted-hostname:8000/v1/chat/completions
This endpoint is correct, since if I curl on that endpoint:
$ curl http://redacted-hostname:8000/v1/chat/completions
{"error":{"code":401,"message":"Invalid API Key","type":"authentication_error"}}%
If I proxy that endpoint locally and change the endpoint to http://localhost:8000 it does work as expected.
For what it's worth: that endpoint is connected to my machine using Tailscale.
Insider version:
Version: 1.106.0-insider
Commit: 13330e1420d478afab7ce453f6d1e307abf7412e
Date: 2025-11-03T05:02:49.231Z
Electron: 37.7.0
ElectronBuildId: 12597478
Chromium: 138.0.7204.251
Node.js: 22.20.0
V8: 13.8.258.32-electron.0
OS: Linux x64 6.17.6-arch1-1
@hgaiser Please file a new issue so we can investigate