holmesgpt
holmesgpt copied to clipboard
`holmes ask` does not follow strictly the tsg provided as prompt
What happened?
I created TSG prompt as expect the holmes ask to follow the TSG to debug the DNS resolution issue in my cluster by
holmes ask "detect why the k8s pod client under namespace test-ns cannot resolve dns" -f /home/azureuser/llm/demo/dns_troubleshooting_instructions.md
......
Running tool kubectl_get_by_kind_in_namespace: kubectl get --show-labels -o wide pods -n kube-system tools.py:143
Running tool kubectl_get_by_kind_in_namespace: kubectl get --show-labels -o wide svc -n kube-system tools.py:143
Running tool trace_dns_gadget: fetch DNS queries and responses of pod test-ns/client for 30 seconds tools.py:143
tool_calling_llm.call - completed in 2 iterations - 83741ms performance_timing.py:41
AI: DNS resolution from pod client in namespace test-ns is failing due to queries not receiving responses from the DNS service (kube-dns) at ClusterIP
10.0.0.10. The DNS trace shows repeated outgoing queries for nginx-service.test-ns.test-ns.svc.cluster.local. without any responses.
Key findings:
• CoreDNS pods (coredns-57d886c994-8h9gt, coredns-57d886c994-vqmrz) are running normally in the kube-system namespace.
• DNS service (kube-dns) is correctly defined with ClusterIP 10.0.0.10 and ports 53/UDP and 53/TCP.
• DNS queries from the pod client are sent correctly to the DNS service IP but no responses are received.
Possible causes:
1 NetworkPolicy blocking DNS traffic to/from the DNS pods.
2 Misconfiguration in CoreDNS causing it not to respond.
Next steps:
• Check NetworkPolicies in the cluster to ensure DNS traffic (port 53 UDP/TCP) is allowed.
• Review CoreDNS configuration for potential misconfigurations.
Refer to the official Kubernetes DNS debugging guide for detailed troubleshooting:
• Main guide: Kubernetes DNS Debugging
• CoreDNS specific: CoreDNS Customization
the command stops after several iterations before finding the root case which can be detected if following the TSG.
Besides the TSG as prompt, I also created a toolset inspect gadget to collect DNS trace
What did you expect to happen?
holmes ask does not follow the TSG to detect the asked error and report how to fix the issue
How can we reproduce it (as minimally and precisely as possible)?
I tried the command on gpt-40 and gpt-4.5, both failed as described earlier.
holmes ask "detect why the k8s pod client under namespace test-ns cannot resolve dns" -f /home/azureuser/llm/demo/dns_troubleshooting_instructions.md
Anything else we need to know?
No response
Hi @mainred, thank you very much for reporting. We're looking into it! Is there an easy way to reproduce the DNS error on my own cluster to test this?
basically I create a network policy to block the dns request from the client pod. the specs can be found here.
cat block-dns.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: test-ns
spec:
podSelector: {}
policyTypes:
- Egress
egress: []
---
apiVersion: v1
kind: Pod
metadata:
name: client
namespace: test-ns
labels:
app.kubernetes.io/name: client
spec:
containers:
- image: mainred/client:v2
imagePullPolicy: Always
name: test-client
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: test-ns
labels:
app.kubernetes.io/name: server
spec:
containers:
- name: nginx
image: nginx:stable
ports:
- containerPort: 80
name: http-web-svc
---
apiVersion: v1
kind: Service
metadata:
name: nginx-service
namespace: test-ns
spec:
selector:
app.kubernetes.io/name: server
ports:
- name: name-of-service-port
protocol: TCP
port: 80
targetPort: http-web-svc