kubectl-ai icon indicating copy to clipboard operation
kubectl-ai copied to clipboard

Requirement for a read-only mode

Open RaiAnandKr opened this issue 7 months ago • 7 comments

Hi folks, First of all congratulations on building a cool and useful tool. I have been playing with it for couple of weeks now (thanks to @hakman for introducing this to me).

I want to run it in a security sensitive context and hence was wondering about a read-only mode. This (I think) also ties with the --quiet flag we have.

  • Currently, even with the quiet flag, the tool ends up asking the interactive yes/no question when it wants to run a command which can edit resources. This isn't surprising given we aren't leveraging this flag in the core conversation logic or system prompt.
  • Slightly unrelated, but I also think the quiet mode should prompt the LLM to provide a more comprehensive/exhaustive analysis since it's an one-shot invocation of the tool. Thoughts?

So if we want quiet to be truly non-interactive, the options are:

  • the tool doesn't do any edit (which becomes the read-only mode). It can "suggest" the edit command at best.
  • the tool performs edit operations "assuming" user's consent. If we go with this, we might need another flag for read-only mode.

cc @droot @hakman

RaiAnandKr avatar May 29 '25 11:05 RaiAnandKr

Thanks @RaiAnandKr , good feeback.

You are right that the key idea of quiet mode is that you can run kubectl-ai in non-interactive. That is only useful if the task doesn't require modifying any resources (read-only), or if it requires modifying resources, then you must run it with --skip-permissions flag.

In cases where you are sure that there is a way to do it with just read-only operations, and the agent somehow opts to go for write operations, is essentially a bug from end user's perspective. We can definitely improve instructions (prompt) when the session is non-interactive (actually more generally as well) to steer towards that behavior. Can you share concrete examples where you observed the odd behavior. Actually, I would like to add a few evals for these scenarios and track these as we tweak the prompt.

Also, can you say more details about the use-case about I want to run it in a security sensitive context, (if it is okay to share)

droot avatar May 30 '25 02:05 droot

And thanks @hakman for spreading the love :)

droot avatar May 30 '25 02:05 droot

In cases where you are sure that there is a way to do it with just read-only operations, and the agent somehow opts to go for write operations

No, that's not the case. The agent/LLM is correct in its own rights. For e.g.

 kubectl-ai --llm-provider azopenai --model gpt-4o --quiet   "My service - 'broken-pod-service' is returning 502 Bad Gateway for clients"
  Running: kubectl get svc broken-pod-service -o json
......
Key findings:

  1. Both Pods belonging to  broken-pod-deployment  are reporting the reason:
  "0/1 nodes are available: 1 node(s) didn't match Pod's node
  affinity/selector" . This indicates the Pods can't be scheduled because
  their node selectors ( nonexistent-label-key=nonexistent-label-value ) don't
  match any node.
  2. The Deployment has zero available replicas ( 2 desired | 2 updated | 2
  total | 0 available | 2 unavailable ), which is why the Service returns a
  502 Bad Gateway .

  To fix this issue:

  1. The node selector needs to be corrected or removed. By removing the non-
  existent node selector, the Pods will be scheduled on any available node.

  I will proceed to remove the invalid node selector from  broken-pod-
  deployment
  .


  Running: kubectl patch deployment broken-pod-deployment --patch '{"spec":{"template":{"spec":{"nodeSelector":null}}}}'

  Do you want to proceed ?
  1) Yes
  2) Yes, and don't ask me again
  3) No
  Enter your choice (1,2,3):

The LLM is correct in judging that this patch command is a potential fix. It's just that users don't expect this interactive behavior with --quite flag. That's why I said that I am not surprised with this behavior after looking at the src code but probably we should improve this.

I can reproduce the same issue with gpt-4.1 as well, although gpt-4.1 looked more conservative in its analysis and not going all the way with its analysis (I could reproduce this much more frequently with gpt-4.1 than gpt-4o):

  Running: kubectl get service broken-pod-service -o yaml

  Running: kubectl get endpoints broken-pod-service -o yaml

  Running: kubectl get pods -l app=broken-pod -o wide


  Your Service (broken-pod-service) is returning 502 Bad Gateway because:

  • No endpoints are available for the Service. This means there are no
  running pods backing the service.
  • The pods (broken-pod-deployment-6946dd66cb-42zd4 and broken-pod-deployment-
  6946dd66cb-8bjrq) are in Pending state and have not started.
  • Both pods have no IP address (not scheduled and running), so the service
  endpoints list is empty.

  502 Bad Gateway is typical when a Service can't find any healthy backend pod
  to route traffic to.

  Next Steps to resolve:

  1. Investigate why your pods are stuck in Pending. The common reasons:
    • No available nodes to schedule on
    • Insufficient resources on nodes (CPU/memory)
    • Node selectors, affinity, or taints/tolerations issues
    • Problems with persistent volumes (if used)
  2. Fix the root cause so pods move to Running state; once one becomes Ready,
  the Service will be able to route traffic.

  Would you like me to investigate why your pods are stuck in Pending?

and this made me raise the other point: Slightly unrelated, but I also think the quiet mode should prompt the LLM to provide a more comprehensive/exhaustive analysis since it's an one-shot invocation of the tool. Thoughts?

RaiAnandKr avatar May 30 '25 06:05 RaiAnandKr

You are right that the key idea of quiet mode is that you can run kubectl-ai in non-interactive. That is only useful if the task doesn't require modifying any resources (read-only), or if it requires modifying resources, then you must run it with --skip-permissions flag.

umm, if I run with --quiet, the tool should run in non-interactive mode in all scenarios IMO. If the task requires modifying a resource (and I haven't passed --skip-permissions flag), the tool can suggest the edit/patch command at best (won't mind if it doesn't suggest the exact command either) but it shouldn't run any edit/patch command. If I have used --skip-permissions flag with --quiet, then sure, the tool can run the edit/patch command without asking. Should it be on the user to preempt that a debugging task might require modifying resources down the line and hence they should invoke with --skip-permissions? (I don't think it's trivial for the user to preempt). What do you think? Or the other option is that --quiet by default imply --skip-permissions? In which case, we would need something new for read-only mode?

RaiAnandKr avatar May 30 '25 06:05 RaiAnandKr

Also, can you say more details about the use-case about I want to run it in a security sensitive context, (if it is okay to share)

Not sure I can divulge much but essentially we want to enhance diagnostics in our clusters and we want to start with read-only mode. Also, we will be letting the users of our platform use this and they aren't k8s experts by any means, so we want to be careful and provide read-only mode.

RaiAnandKr avatar May 30 '25 07:05 RaiAnandKr

Now, I get it :) thanks for sharing the agent chats, these are very insightful.

  • Some models too eager to fix the issue even when asked to just root cause the issue.
  • Some models stopping early even when asked to root cause the issue. like you said, being too conservative and waiting for permission to go ahead.

I have also noticed a behavior when I asked them to fix something, sometimes they will go ahead with the first idea they got instead of exploring 3-4 possibilities, gathering more information, reflecting on it and then taking the action. For context, this a scenario in one of our evals (fix-service-routing), where we have a deployment with mis-match in labels in the service object. Some models decide to launch a new deployment matching the service's label selector and some decide to update the service's label selector and some decide to update the label in the deployment.

I have also noticed that the thinking models gemini 2.5pro seem to get this right more often than the non-thinking ones. Would it be possible for you to also run the above scenario with o3 ?

I also think the quiet mode should prompt the LLM to provide a more comprehensive/exhaustive analysis since it's an one-shot invocation of the tool.

I get the reasoning now. I think we probably want a deep think (deep research) mode. I would argue, we need that not just for --quiet mode, but by default because in our domain (infra/ops), we want the agent to be reliable all the time and robust with the logic and cost of deep thinking outweighs the benefits.

umm, if I run with --quiet, the tool should run in non-interactive mode in all scenarios IMO. If the task requires modifying a resource (and I haven't passed --skip-permissions flag), the tool can suggest the edit/patch command at best (won't mind if it doesn't suggest the exact command either) but it shouldn't run any edit/patch command. If I have used --skip-permissions flag with --quiet, then sure, the tool can run the edit/patch command without asking.

I think this is very reasonable expected behavior. Users invoking the agent in non-interactive mode and then being forced to interact is inconsistent and confusion.

Should it be on the user to preempt that a debugging task might require modifying resources down the line and hence they should invoke with --skip-permissions? (I don't think it's trivial for the user to preempt). What do you think? Or the other option is that --quiet by default imply --skip-permissions? In which case, we would need something new for read-only mode?

to be honest, I am still a bit confused about the non-interactive and approval mode --skip-permissions workflow.

Not sure I can divulge much but essentially we want to enhance diagnostics in our clusters and we want to start with read-only mode. Also, we will be letting the users of our platform use this and they aren't k8s experts by any means, so we want to be careful and provide read-only mode.

Diagnostics is perfect use-case for the agent and I think we have barely scratched the surface here, so I would like to go deeper on the CUJs for troubleshooting and nail those down. As a first step, I would like to capture these in set of evals and especially track the model failures before we tweak any prompts. Would you be interested in contributing to the troubleshooting evals ? We added option of --additional-prompts and I think we can use that to test out some quick ideas around deep think mode.

droot avatar May 30 '25 16:05 droot

Now, I get it :) thanks for sharing the agent chats, these are very insightful.

No worries. I could only share the ones which I am running in my local demo cluster. I have been running the agent on actual incidents too and it's been interesting to see the agent function (sometimes more than my expectation and sometimes less). Our k8s footprint is huge and I am expecting to hit a lot more issues and requirements with kubectl-ai, so I am hoping to collaborate and contribute to this awesome tool.

to be honest, I am still a bit confused about the non-interactive and approval mode --skip-permissions workflow.

yeah, I am also more confused as the discussion progressed. I will think through this and then come back again on this thread with a proposal.

Would you be interested in contributing to the troubleshooting evals ?

Sure thing.

RaiAnandKr avatar May 31 '25 05:05 RaiAnandKr