stream model will report simulated streaming failed during non-streaming call
as below `[root@k8s-test ~]# kubectl-ai --llm-provider=openai --model=qwen3-235b-a22b
Hey there, what can I help you with today?
show me all deployment in the deafult namespace E0508 11:30:21.562983 32248 openai.go:249] OpenAI ChatCompletion API error: POST "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions": 400 Bad Request {"code":"invalid_parameter_error","param":null,"message":"This model only support stream mode, please enable the stream parameter to access the model. ","type":"invalid_request_error"} Error: simulated streaming failed during non-streaming call: OpenAI chat completion failed: POST "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions": 400 Bad Request {"code":"invalid_parameter_error","param":null,"message":"This model only support stream mode, please enable the stream parameter to access the model. ","type":"invalid_request_error"}
exit [root@k8s-test ~]# kubectl-ai --llm-provider=openai --model=qwen-plus
Hey there, what can I help you with today?
show me all deployment in the deafult namespace Running: kubectl get deployments -n deafult`
Thanks @limboys for reporting the issue.
I understand we are using openai compatible llm-provider here, but what is the underlying provider here ? (alibaba ?)
Looks like we might have to provide a command line configuration to toggle stream mode. I am not sure if models API expose this configuration otherwise smart thing would have been to detect it on the basis of model configuration.
/cc @tuannvm @hakman @justinsb
I meet the same problem
Error: simulated streaming failed during non-streaming call
The issue is that the model supports streaming mode only
https://www.alibabacloud.com/help/en/model-studio/stream
Open source Qwen3, Models, and Models models only support streaming output.
The openai.go implementation does not have streaming enabled
https://github.com/GoogleCloudPlatform/kubectl-ai/blob/main/gollm/openai.go#L122-L129
Perhaps the best option is to move this into its own llm file as it was done for grok and add streaming capabilities for newer models:
chatReq := openai.ChatCompletionNewParams{
Model: openai.ChatModel(req.Model),
Messages: []openai.ChatCompletionMessageParamUnion{
openai.UserMessage(req.Prompt),
},
Stream: true, // Required for Qwen3-235B and other stream-only models
}
While also allowing non-streaming for older models, if you can, switch to an older model and try that as a workaround.
Hope this helps!
Hmm, I wonder if it would be possible to provide a flag to control the addition of extra parameters or a separate stream flag to enable or disable streaming requests. such as kubectl-ai --llm-provider=openai --model=qwen3-235b-a22b --stream=true
What if we could add logic to determine whether the models support streaming? The flag seems reasonable to me, but it wouldn’t provide the best user experience, since users would need to know which models support streaming. In my opinion, the tool should fail fast and notify the user before entering prompt mode.
@droot I can take this
What if we could add logic to determine whether the models support streaming? The flag seems reasonable to me, but it wouldn’t provide the best user experience, since users would need to know which models support streaming. In my opinion, the tool should fail fast and notify the user before entering prompt mode.
While this would be interesting, this means we need to keep track of the models ourselves(?), also some models like OpenAI ones have streaming as a feature, doc Anyhow, streaming should be a feature in order to cover all models either we implement automatic model detection or allow a --streaming True/False flag
Agree with @zvdy , streaming as a feature will be needed anyways.
Remember our goal here is to unblock users to be able to use whatever model meets their need best (many factors goes into why a model is best for them...).
Acceptable UX:
Add streaming as a feature in the gollm layer and add a command line flag (--use-streaming) to enable/disable feature (kubectl-ai and k8s-bench CLIs both). [Note we had to do this for tool-use, that's why --enable-tool-use flag exists today].
kubectl-ai` can detect invalid stream mode use (can do on the basis of error signature) and can suggest the user a workaround to invoke the tool with a command line flag. (We can also document)
Note if supporting this flag for all the llm-providers turns out to be tricky, we can introduce llm-specific flag as well if that keeps the implementation simple. It will depend on how the implementation look like.
Auto-detecting Tracking model capabilities ourselves is going to be maintenance nightmare and will increase toil on us maintainers, so not desirable. I have doubts that llm-providers expose APIs to discover this capability easily but I could be wrong.
@droot @zvdy please take a look https://github.com/GoogleCloudPlatform/kubectl-ai/pull/209
@limboys @zackertypical @tuannvm can you pl. help test this PR out https://github.com/GoogleCloudPlatform/kubectl-ai/pull/215 and confirm if this solve your issue.
Thank you such.
how to build? @droot
Still working on this, for some reason, openai tool_calls differ from grok ones, the arguments are splitted and chunked while works for grok models with streaming, it does not for openai models.
Regarding the build, you would need to git clone my branch referenced in #215 where the PR was emitted from and go build from source there
Anyhow, i will come back to this issue if I manage to fix and test the solution for that specifically!
Finished the implementation, tested supported models for qwen openai etc, all work, feel free to test it #215
there are a few caveats about qwen and alibabacloud, point to the correct endpoint and have look at https://github.com/GoogleCloudPlatform/kubectl-ai/pull/215#issuecomment-2870598371
- @zackertypical @limboys
Awesome. Thanks @zvdy if everything looks good, we can cut a release out today.