kubectl-ai bug: kubectil-ai invokes incomplete commands (without the main command)

Sometimes kubectl-ai will invoke tools without the main command (as prefix), for example:


# without kubectl prefix
Running: get pods -n web

Even though we specify in the tool description that command is a complete command including the kubectl prefix, some models tend to be bad at instruction following. It also suggests we have a gap in our instructions for tool use.

I recently tried custom tool and discovered top tier foundation models were running into this problem, so I suspect we are missing something basic here.

In this video, you can see user encounters it repeatedly with qwen-coder model with ollama provider.

A few ideas for exploration:

Tweak the main prompt
Tweak the tool-user or function-calling instructions
Additional checks in the code to ensure not to invoke without the main command (this could be an issue with composite commands (pipe syntax etc.)

May 23 '25 21:05 droot

@hakman @zvdy @tuannvm @selimacerbas wondering if you guys have run into this issue.

May 23 '25 21:05 droot

To be honest while watching the video, told to myself, what the heck is that command! I never experienced it during my tests( I was using only gemini-flash 2.0 model. ) Your suggestions are on point. I could take a look at. @droot

May 23 '25 22:05 selimacerbas

This seems familiar. I think there is a problem choosing the right tool there. If it would use kubectl, it would have worked. My guess is that the model generates the args for kubectl, but then it tries to use bash to run, which is odd.

Additional observation, the current display of running the command is confusing. I would like to see the exact tool that is called.

May 24 '25 05:05 hakman

I encountered a similar issue in another scenario.

I used the --custom-tools mentioned in the README. The command is as follows:

root@gpu3090:~/oss/kubectl-ai# kubectl-ai --v 10 --llm-provider openai   --model qwen-plus      --custom-tools-config docs/sample-tools.yaml

  Hey there, what can I help you with today?

>>> deploy 2048 using helm
  Running: install 2048-game ./2048

The Helm CMD was lost.

May 26 '25 10:05 yankay

Did you add a custom helm tool?

May 26 '25 10:05 hakman

And I am also wondering if this is happening at the first/initial prompt execution?

I encountered a similar issue in another scenario.

I used the --custom-tools mentioned in the README. The command is as follows:
root@gpu3090:~/oss/kubectl-ai# kubectl-ai --v 10 --llm-provider openai   --model qwen-plus      --custom-tools-config docs/sample-tools.yaml

  Hey there, what can I help you with today?

>>> deploy 2048 using helm
  Running: install 2048-game ./2048
The Helm CMD was lost.

And I am wondering if this is happening at the first/initial prompt execution?

May 26 '25 11:05 selimacerbas

I've been trying to reproduce this without success, can we get a log? I cannot see any patterns, I would like to focus on the Running: get pods -n web before we dig into the help and custom-tool issues

May 27 '25 16:05 zvdy

cross posting my comment from the PR:

We will see this pattern again and again where frontier models will be more accurate than smaller or old models. Our natural instinct will to try to patch these edge cases like this. This may work in the short term, but will be irrelevant in the long term. AI researchers and practitioners refers to this as the bitter lessons. So, frontier model performance will be the leading indicator and it is probably safe to assume that open models will eventually follow. If this is the reality, then how should we go about it. A few things that will help:

Comprehensive evals (k8s-bench) that shows different model's performance for k8s tasks and we also track it over time. And when the end-users are making a choice for what model to use, they can look at the k8s-bench and can at least rule out some models to avoid this frustration or claiming kubectl-ai or AI in general to be useless. And if we have to decide on patching these edge cases, evals can inform that decision as well. for those rare cases, where we have to patch these edge cases, we can reduce the pain for us maintainers by good engineering. In this specific case, good engineering will be for ex. have a good parser for the commands we are invoking. We have had instances where we had to filter a few commands (for ex. detecting interactive or streaming commands). I was looking more into it and found there is a robust bash parsing library mvdan/sh that can be used to inspect the tool-call better and then patch it more robustly. Can we make smaller model to perform better for our (k8s and infra) domain ?. I think this where a partnership with AI researchers will help. There are techniques such as distilling knowledge from frontier models as synthetic data and then fine-tuning the smaller models with that synthetic data. And evals will help here to verify the results. @selimacerbas It may appear that the PR is going no where, but this has already helped the project in a big way IMO in helping us build better understanding. When you build stuff at the edge (frontier), you will see gaps before others see it and these gaps are what makes the difference and sort of guide you how to move the field forward (progress).

This reminds me of an essay from Paul Graham Live in the future, and then build what's missing.

May 28 '25 16:05 droot