telepresence
telepresence copied to clipboard
Unable to intercept a service on AWS EKS
I'm able to connect to a cluster in the current context but when I try to intercept an existing service running in a namespace I have an error and the intercept session cannot be established
To Reproduce
- deploy the traffic manager
telepresence helm install
- connect to the cluster
telepresence connect
- intercept a service
telepresence intercept directory --port 8084:80 --namespace afe01 --env-file directory.env
I see this error
Error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent directory.afe01 to arrive
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent directory.afe01 to arrive
Expected behavior I expect the intercept session to be created
**Versions **
- Output of
telepresence version
telepresence version
Client: v2.7.1 (api v3)
Root Daemon: v2.7.1 (api v3)
User Daemon: v2.7.1 (api v3)
- Operating system of workstation running
telepresence
commands
MacOS Monterey
- Kubernetes environment and Version [e.g. Minikube, bare metal, Google Kubernetes Engine]
Kubernetes AWS EKS
VPN-related bugs: No VPN is installed
Additional context The Kubernetes API is reachable through an ssh tunnel
+1, getting same issue and not having luck finding an answer on this. I did not set up any optional RBAC for telepresence.
I enabled debug log level, downloaded the logs, and see the following (some non-useful lines removed for readability):
Intercept command
telepresence intercept backend --port 8080:80 --docker-run -- ealen/echo-server
Telepresence status
Root Daemon: Running
Version : v2.7.1 (api 3)
DNS :
Remote IP : 127.0.0.1
Exclude suffixes: [.arpa .com .io .net .org .ru]
Include suffixes: []
Timeout : 8s
Also Proxy : (0 subnets)
Never Proxy: (2 subnets)
- 54.201.219.52/32
- 54.200.13.75/32
User Daemon: Running
Version : v2.7.1 (api 3)
Executable : /Users/admin/Library/Application Support/telepresence/telepresence-pro
Install ID : <REDACTED>
Ambassador Cloud:
Status : Logged in
User ID : <REDACTED>
Account ID: <REDACTED>
Email : <REDACTED>
Status : Connected
Kubernetes server : https://<REDACTED>.gr7.us-west-2.eks.amazonaws.com
Kubernetes context: hubofhubs-release-us-west-2
Intercepts : 0 total
telepresence status
backend: ready to intercept (traffic-agent not yet installed)
traffic-manager.log
Tailing the logs of the traffic-manager pod, this is all I see:
2022-08-21 19:25:24.9142 debug httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="aa80ba34-78ea-456f-a2b3-08061107bc57"
There are also a lot of lines like this, but not sure what to make of it... I'm on the free unlimited trial?
│ 2022-08-21 19:41:51.6443 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error │ ││ 2022-08-21 19:41:51.9408 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error │ │
│ 2022-08-21 19:41:52.2485 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error │ │
│ 2022-08-21 19:41:52.5482 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error │ │
│ 2022-08-21 19:41:52.8688 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error │ │
│ 2022-08-21 19:41:53.1885 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error
daemon.log
2022-08-21 12:06:54.0663 debug Logging at this level "debug"
2022-08-21 12:07:49.4877 debug daemon/session/dns/Server : LookupHost "kubernetes.default"
2022-08-21 12:07:49.5152 debug daemon/session/dns/Server : AAAA kubernetes.default. -> NOERROR EMPTY
2022-08-21 12:07:49.5158 debug daemon/session/dns/Server : A kubernetes.default. -> NOERROR kubernetes.default. 4 IN A 172.20.0.1
2022-08-21 12:07:49.5199 debug daemon/session/router : ++ POOL tcp 172.20.0.0:57254 -> 172.20.0.1:443, count now is 1
2022-08-21 12:07:49.5203 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, TCP window 65535 after window update
2022-08-21 12:07:49.5206 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state IDLE -> SYN RECEIVED
2022-08-21 12:07:49.5207 debug daemon/session/router : Opening tunnel for id tcp 172.20.0.0:57254 -> 172.20.0.1:443
2022-08-21 12:07:49.5820 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, TCP window 262144 after ack
2022-08-21 12:07:49.5821 debug daemon/session/router : CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, ReadLoop starting
2022-08-21 12:07:49.5822 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state SYN RECEIVED -> ESTABLISHED
2022-08-21 12:07:49.5825 debug daemon/session/router : CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, WriteLoop starting
2022-08-21 12:07:49.6924 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state ESTABLISHED -> TIMED WAIT
2022-08-21 12:07:49.7087 debug daemon/session/router : CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, ReadLoop ended
2022-08-21 12:07:49.7087 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443 TCP window is zero
2022-08-21 12:07:50.6938 debug daemon/session/router : CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state TIMED WAIT -> IDLE
2022-08-21 12:07:50.6939 debug daemon/session/router : CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, WriteLoop ended
connector.log
2022-08-21 12:14:01.8814 debug connector/session/List-0 : replacing ReplicaSet backend-5fccb8d949.ted5b32-telepresence-matwerber1-main, with owner Deployment backend
2022-08-21 12:14:01.8815 debug connector/session/List-0 : replacing ReplicaSet backend-5fd65c7df.ted5b32-telepresence-matwerber1-main, with owner Deployment backend
2022-08-21 12:14:01.8816 debug connector/session/List-0 : replacing ReplicaSet backend-7969bcdcb6.ted5b32-telepresence-matwerber1-main, with owner Deployment backend
2022-08-21 12:14:01.8816 debug connector/session/List-0 : Getting info for Deployment backend.ted5b32-telepresence-matwerber1-main, matching service backend.ted5b32-telepresence-matwerber1-main
2022-08-21 12:14:29.2198 error connector/session/reverse-conn/client : goroutine "/connector/session/reverse-conn/client" exited with error: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR
2022-08-21 12:14:29.2212 info connector/session/reverse-conn:shutdown_logger : shutting down (gracefully)...
2022-08-21 12:14:29.2225 debug connector/session/reverse-conn/server : goroutine "/connector/session/reverse-conn/server" exited without error
2022-08-21 12:14:29.2226 info connector/session/reverse-conn:shutdown_status : final goroutine statuses:
2022-08-21 12:14:29.2226 info connector/session/reverse-conn:shutdown_status : /connector/session/reverse-conn/client: exited with error
2022-08-21 12:14:29.2227 info connector/session/reverse-conn:shutdown_status : /connector/session/reverse-conn/server: exited without error
2022-08-21 12:14:29.2227 debug connector/session/reverse-conn : reverse-conn waiting 200ms before retrying after error: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR
Also, this raises a feature request idea: would really like to see logs and troubleshooting tips in the Ambassador UI
So two things I noticed:
-
I'll try adjusting timeouts using the Telepresence config file described here: https://github.com/telepresenceio/telepresence.io/blob/master/docs/v2.4/reference/config.md
-
Despite explicitly configuring
~/.kube/config
with the "extensions" section described in these docs, I checked the file and noticed that section of the config disappeared. It's the second time I've observed this, as the first time I thought "huh, I must have not saved it". I'm not sure if it's Telepresence, some sort of VS Code extension, or something else changing it... but I'm going to try fswatch and see if I can find out what's going on.
extensions:
- name: telepresence.io
extension:
manager:
namespace: ambassador
We do have a weekly live help session for Telepresence with Support on Tuesdays at 2:30pm ET. If you'd like to attend please let me know. I need an email for the calendar invite which you can email to me at [email protected].
Based on some advice provided in this slack thread: https://datawire-oss.slack.com/archives/CAUBBJSQZ/p1661549637194529
That thread, which describes this problem pretty much exactly except on a GCP cluster, includes a reference to some GCP documentation for modifying a firewall within the cluster to allow for some additional traffic to flow between the control plane and the nodes.
I tried to apply the instructions in the linked article to our EKS configuration but without any luck. According to the EKS docs when setting up the cluster security:
The default rules allow all traffic to flow freely between your cluster and nodes, and allows all outbound traffic to any destination
But I tried adding an additional rule to for port 8443 of the cluster CIDR (as described in the GCP instructions) to all of the security groups involved and it didn’t make a difference in the outcome.
It would be helpful to know exactly what request or message is not getting through. Is there any diagnostic tests we can do to help narrow down where the problem is?
@matwerber1 @sgilhooly it would be interesting to see the complete logs from the traffic-manager after doing telepresence loglevel debug
and while trying to intercept. The "Issuing a systema request without ApiKey or InstallID may result in an error" warning shouldn't matter at all (I have a PR up to remove that warning).
I restarted the traffic-manager
deployment (to restart the log file clean). I then ran
telepresence connect
telepresence list
telepresence intercept notify --port 9190 --env-file tele-notify.env
When I remove all of the "issuing a systema request" lines from the traffic-manager log file, here's what's left:
2022-09-01 19:01:37.7648 info Traffic Manager v2.7.1 [pid:1]
2022-09-01 19:01:37.7654 info unable to load license: error reading license: open /home/telepresence/license: no such file or directory
2022-09-01 19:01:37.7842 info Using DNS IP from kube-dns.kube-system
2022-09-01 19:01:37.7904 info Using cluster domain "cluster.local."
2022-09-01 19:01:37.7936 info Extracting service subnet 172.20.0.0/16 from create service error message
2022-09-01 19:01:37.7937 info Using podCIDRStrategy: auto
2022-09-01 19:01:37.7952 info prometheus : Prometheus metrics server not started
2022-09-01 19:01:37.7956 info agent-injector : Loading ConfigMaps from []
2022-09-01 19:01:37.7957 info agent-injector : Mutating webhook service is listening on :8443
2022-09-01 19:01:37.8940 info Scanning 25 nodes
2022-09-01 19:01:37.8940 info No subnets found
2022-09-01 19:01:37.8940 error Unable to derive subnets from nodes
2022-09-01 19:01:38.0952 info Deriving subnets from IPs of pods
2022-09-01 19:01:38.7964 info agent-configs : Started watcher for ConfigMap telepresence-agents cluster wide
2022-09-01 19:01:38.7964 info agent-configs : Started watcher for Services cluster wide
2022-09-01 19:01:38.8193 info agent-configs : Successfully rolled out notify.arosa
2022-09-01 19:01:38.8387 info agent-configs : Successfully rolled out directory.afe01
2022-09-01 19:02:17.4460 debug Logging at this level "debug"
2022-09-01 19:02:18.4823 debug httpd/conn=127.0.0.1:8081 : LookupHost called tel2-recursion-check.kube-system : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
2022-09-01 19:02:18.4903 debug httpd/conn=127.0.0.1:8081 : LookupHost on traffic-manager: tel2-recursion-check.kube-system -> NOT FOUND : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
2022-09-01 19:02:25.8426 debug httpd/conn=127.0.0.1:8081 : WatchAgentsNS called : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
2022-09-01 19:02:27.2097 debug agent-configs : Regenerating config entry for Deployment notify.arosa
2022-09-01 19:02:28.0958 debug podWatcher calling updateSubnets with [10.19.128.0/17]
2022-09-01 19:03:03.0957 debug podWatcher calling updateSubnets with [10.19.128.0/17]
2022-09-01 19:03:13.7914 debug httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
My guess is that something prevents our mutating webhook from calling the agent-injector service. That service is served by the traffic-manager and the agent injection should be logged just after the "PrepareIntercept called...". A normal debug log looks like this:
2022-09-02 21:33:07.3904 debug httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="efb7c300-23de-4394-8fa8-3380dbd3a611"
2022-09-02 21:33:07.4126 debug agent-configs : MODIFIED telepresence-agents.default
2022-09-02 21:33:07.4127 debug agent-configs : add my-service.my-namespace
2022-09-02 21:33:07.4214 info agent-configs : Successfully rolled outmy-service.my-namespace
2022-09-02 21:33:07.4454 debug agent-injector/conn=10.42.0.179:8443 : Received webhook request...
Logs
Re: logs, I'll see if I can dig them up, not sure I kept them - but FWIW, I (think) I looked through carefully through each file with debug-level logging, and don't think I saw anything else relevant.
Resolution
In any case, I did find the problem my namespace had a memory quota, but the sidecar injected by telepresence in the new deployment did not specify a limit.
I first used k9s to poke around the target namespace and saw that Telepresence had created a new deployment (as expected) with the sidecar traffic manager, but noted that the deployment was failing (stuck at 0/1 desired pods).
I ran the following command to watch a real-time list of events happening within the namespace:kubectl get events --namespace SOME_NAMESPACE -w
and saw Error creating :pods api-xxxx is forbidden: failed quota: quota: must specify memory
:
We use a SaaS environment management platform (our own), to among other things, abstract and simplify K8s deployments. I had missed the fact that we apply a default resource quota to namespaces created by our platform.
I grabbed the list of quotas in the namespace via: kubectl get resourcequota --namespace YOUR_NAMESPACE
, saw there was only one quota named quota
, and deleted it with kubectl delete resourcequota NAME_OF_QUOTA --namespace SOME_NAMESPACE
.
After that, things started working :)
Feature suggestion
It seems like the CLI will return a success as long as the new deployment is created, but (correct me if wrong) it doesn't seem like there's a way to check whether the pods are up and running with telepresence status
or similar? I also didn't see indication of this in the Ambassador UI; everything seemed ready-to-go. If that's correct, adding some sort of check around pod availability for the telepresence-ready deployment might help the user experience.
Possible for us to customize the deployment manifest?
Was wondering... how does the template for the modified deployment with the traffic agent get defined? Wondering if there's an easy way I can customize that template to include a memory resource limit?
It's easy enough to manually remove the namespace quota or add a limit to a deployment after the fact, but ideally we could bake that in somewhere when deploying Telepresence?
Oh, that's interesting. I like your suggestion. We should verify that the pods are created OK and warn when they are not, and if possible, try to figure out why that is.
I think we should also detect the presence of quotas and tell the user if they block pod creation, and provide a way to specify the limit for injected sidecars.
@thallgren: I would have thought that the mutating trigger would trigger after something has been mutated. But I don't see any activity in Kubernetes of any mutations -- of course I don't know what to look for either so very likely that it exists. When I run the telepresence intercept
command what does that do which would trigger the webhook?
The quota issue is addressed in #2773.
@sgilhooly The mutating webhook is triggered when a workload is rolled out (typically after modifying or creating a deployment). It's task is to make the pods comply with the modification. Our agent-injector takes part in that task. When you perform an intercept, a check is made if the traffic-agent config exists in the telepresence-agents config-map. If it doesn't, an entry is generated and stored in that config-map. This, in turn, triggers a watcher that discovers this new entry and performs a roll-out of the associated workload. That triggers the mutating webhook, and hence, the agent injection.
If you don't see any events reported from kubectl get events
when you perform an intercept, then that's probably because the entry already exists in the telepresence-agents config-map (you can check this by doing kubectl describe configmap telepresence-agents
). A simple way to remove the entry if that's the case, is to do telepresence uninstall --agent <name of workload>
. That will force a new injection attempt.
I did notice today that running the telepresence intercept command causes some activity on the deployment I'm trying to intercept. Basically, it re-deploys the workload as if it did a "rollout restart deploy
I also tried upgrading from 2.7.2 to 2.7.3 and I do now see some additional logs in the traffic-manager pod:
2022-09-09 02:50:16.6336 debug httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="b2e3d767-47e8-4425-bff7-021079959985"
2022-09-09 02:50:16.8589 debug agent-configs : MODIFIED telepresence-agents.mynspace
2022-09-09 02:50:16.8589 debug agent-configs : add login.mynspace
2022-09-09 02:50:16.8735 info agent-configs : Successfully rolled out login.mynspace
But on the client side I'm still getting the same error:
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent login.mynspace to arrive
+1 same issue here for a private GKE cluster. Following instructions to allow traffic to tcp/8443 did not seem to make any difference. The memory quota issue does not apply (I tested in a new namespace, no quotas on basic echo app). A new deployment or pod is never created, although I do see some volume stuff around kube-api-access and telepresence annotation are added/removed during intercept command and uninstall command. However, the sidecar for traffic-agent is never installed and deadline exceeded happens.
Note if I manually inject the sidecar and update the service, intercepts do work. Something is just not working correctly having it automatically injected.
Lastly, I tried increasing all documented timeouts to 2m from telepresence docs. However the timeout still happens after about 30 seconds which I found odd. I don't know if this particular timeout waiting for agent is covered, but agentInstall didn't seem to do it. Probably unrelated to root cause, but just something I noticed.
+1 Got same issue with v2.7.1 & v2.7.6 clients with v2.7.1 server
Note: I had to downgrade to v2.5.8 both client & server side components to get it works.
I've pretty much hit a dead end here. I don't know what additional information to gather, what else to try, or where else to look. It seems like something here is swallowing up an error but I'm probably going to have to spend a great deal of time getting familiar with the source code to figure anything else out.
Please try to add a firewall rule on the cluster and on the corresponding network so tcp traffic is allowed on port 8443. That's the port that the traffic-manager container uses to serve webhooks. This solved the issue for me on GKE, and I was getting exactly the same error!
@ipeychev : Thanks for the tip but this was the first thing we looked into when we ran into this issue (after hearing other's success with it). But, on AWS/EKS the default security group configuration already lets all traffic pass freely between the control plane and the nodes so it doesn't appear to the be exact same issue. Based on what I can decipher from the logs it also does appear that the webhook is being received by the traffic manager -- so the remaining mystery is why the sidecar is not getting installed.
Telepresence 2.7.6 has been released. We've love for users who encountered the "Deadline Exceeded error" to try it and provide feedback.
I get the same outcome on 2.7.6: The logs have a small bit of additional entries but otherwise the result is the same:
Command:
$> telepresence version
Client: v2.7.6 (api v3)
Root Daemon: v2.7.6 (api v3)
User Daemon: v2.7.6 (api v3)
$> telepresence intercept login --port 9190 --env-file tele-login.env
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent login.mynamespace to arrive
See logs for details (1 error found): "/Users/seangilhooly/Library/Logs/telepresence/connector.log"
If you think you have encountered a bug, please run `telepresence gather-logs` and attach the telepresence_logs.zip to your github issue or create a new one: https://github.com/telepresenceio/telepresence/issues/new?template=Bug_report.md .
Traffic-manager debug log:
2022-09-23 18:33:50.8599 info agent-configs : Successfully rolled out login.mynamespace
2022-09-23 18:33:54.1502 debug httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:33:54.1767 debug agent-configs : MODIFIED telepresence-agents.mynamespace
2022-09-23 18:33:54.1767 debug agent-configs : add login.mynamespace
2022-09-23 18:33:54.1902 info agent-configs : Successfully rolled out login.mynamespace
2022-09-23 18:33:58.0473 error httpd/conn=127.0.0.1:8081 : Readiness probe failed: Get "[http://10.19.188.130:80/status/](http://10.19.188.130/status/)": dial tcp 10.19.188.130:80: connect: connection refused : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:34:01.1096 error httpd/conn=127.0.0.1:8081 : Readiness probe failed: Get "[http://10.19.223.58:80/status/](http://10.19.223.58/status/)": dial tcp 10.19.223.58:80: connect: connection refused : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:34:03.7923 debug podWatcher calling updateSubnets with [10.19.128.0/17]
2022-09-23 18:34:06.8101 error httpd/conn=127.0.0.1:8081 : Liveness probe failed: Get "[http://10.19.188.130:80/status/](http://10.19.188.130/status/)": EOF : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:34:06.8193 error httpd/conn=127.0.0.1:8081 : Readiness probe failed: Get "[http://10.19.188.130:80/status/](http://10.19.188.130/status/)": EOF : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
And as before the “deadline exceeded” error returns after 30 seconds despite the timeout settings I put in my config.yml
$> cat ~/Library/Application\ Support/telepresence/config.yml
timeouts:
agentInstall: 2m
intercept: 120s
logLevels:
userDaemon: trace
rootDaemon: debug
Also, as before, I see the deployment I'm trying to intercept gets a rollout restart, but no sidecar is injected.
@sgilhooly , thanks for testing! It seems your traffic manager is failing liveness and readiness probes. The Traffic Manager will be injecting the Traffic Agent so it seems there’s something up with the traffic manager. Can you share the EVENTS of the Traffic Manager pod - does it show these same errors or any additional information?
@cindymullins-dw: thanks for looking into it. I upgraded to the latest telepresence version and tried again to gather the information you requested. I don't see any complaints or interesting info from the traffic manager (but perhaps I misunderstood your request). After reproducing the problem, here are all the events I get from the traffic manager:
$> kubectl get event --namespace ambassador --field-selector involvedObject.name=traffic-manager-577f4c895c-r8695
LAST SEEN TYPE REASON OBJECT MESSAGE
12m Normal Scheduled pod/traffic-manager-577f4c895c-r8695 Successfully assigned ambassador/traffic-manager-577f4c895c-r8695 to ip-10-19-253-217.ec2.internal
12m Normal Pulling pod/traffic-manager-577f4c895c-r8695 Pulling image "docker.io/datawire/tel2:2.8.2"
12m Normal Pulled pod/traffic-manager-577f4c895c-r8695 Successfully pulled image "docker.io/datawire/tel2:2.8.2" in 1.331321165s
12m Normal Created pod/traffic-manager-577f4c895c-r8695 Created container traffic-manager
12m Normal Started pod/traffic-manager-577f4c895c-r8695 Started container traffic-manager
The new version of telepresence has additional information in the error message:
$> telepresence intercept login --port 9190 --env-file tele-login.env
Launching Telepresence Root Daemon
Need root privileges to run: ~/tools/telep/bin/telepresence daemon-foreground ~/Library/Logs/telepresence '~/Library/Application Support/telepresence'
Password:
telepresence: error: request timed out while waiting for agent login.mynamespace to arrive: Events that may be relevant:
AGE TYPE REASON OBJECT MESSAGE
21s Warning Unhealthy pod/login-78d9df5746-v9mr8 Readiness probe failed: Get "http://10.19.233.42:80/status/": dial tcp 10.19.233.42:80: connect: connection refused
14s Warning Unhealthy pod/login-78d9df5746-v9mr8 Readiness probe failed: Get "http://10.19.233.42:80/status/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
13s Warning Unhealthy pod/login-78d9df5746-v9mr8 Liveness probe failed: Get "http://10.19.233.42:80/status/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I had the same issue:
Error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent xxxx.xxx to arrive
Was able to solve by:
- changing the namespace of the traffic manager (under ~/.kube/config)
extensions:
- name: telepresence.io
extension:
manager:
namespace: ambassador-new
-
telepresence helm install
-
telepresence intercept my-service --port 7654:7654 --env-file my-service_env.txt
Happy to report that after upgrading to version v2.9.2
this problem seems to have been fixed. Intercepting a service was successful (although I had a follow on problem with getting telepresence to recognize/support macFUSE which I had to work through separately). I no longer see any timeout issue and the intercept completes successfully and rather quickly now.
I'm facing the same issue with version 2.9.4. When I try to intercept a service I'm getting this:
$ telepresence intercept security-authorization -n flex --port 8080:api
telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent security-authorization.flex to arrive: Events that may be relevant:
AGE TYPE REASON OBJECT MESSAGE
30s Warning Unhealthy pod/security-authorization-57c84fdbb6-4dcp2 Readiness probe failed: HTTP probe failed with statuscode: 503
30s Warning Unhealthy pod/security-authorization-57c84fdbb6-4dcp2 Readiness probe failed: HTTP probe failed with statuscode: 503
27s Warning Unhealthy pod/security-authorization-57c84fdbb6-4dcp2 Readiness probe failed: Get "http://192.168.183.179:15020/app-health/security-authorization/readyz": dial tcp 192.168.183.179:15020: connect: connection refused
30s Warning Unhealthy pod/security-authorization-57c84fdbb6-4dcp2 Readiness probe failed: HTTP probe failed with statuscode: 503
14s Warning Unhealthy pod/security-authorization-6d6c5d6d8c-gkkmz Readiness probe failed: HTTP probe failed with statuscode: 503
14s Warning Unhealthy pod/security-authorization-6d6c5d6d8c-gkkmz Readiness probe failed: HTTP probe failed with statuscode: 503
14s Warning Unhealthy pod/security-authorization-6d6c5d6d8c-gkkmz Readiness probe failed: HTTP probe failed with statuscode: 503
An interest thing, if I change my traffic-manager
deployment to has hostNetowrk: true
, the intercepting starts to work.
But with this, if I try just to intercept with --local-only
the dns doesn't resolves cluster services, this is the log printed in traffic-manager when hostNetwork: true
and intercepting local-only:
│ 2022-12-06 20:25:01.7112 debug httpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. A : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" │
│ 2022-12-06 20:25:01.7118 debug httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. A -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" │
│ 2022-12-06 20:25:01.7121 debug httpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. AAAA : session_ttpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. A : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" │
│ 2022-12-06 20:25:01.7118 debug httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. A -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" │
│ 2022-12-06 20:25:01.7121 debug httpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. AAAA : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" │
│ 2022-12-06 20:25:01.7124 debug httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. AAAA -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"
```id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" │
│ 2022-12-06 20:25:01.7124 debug httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. AAAA -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"
If I remove the hostNetwork
the --local-only
starts to work again, but the service intercept stops again.
I think this ticket is all over the place now. It seems to act as a bin for all problems related to timeouts when intercepting. Some of which has been successfully resolved. I'm therefore closing this ticket. If you still have problems with "request timed out while waiting for agent xxx.xxx to arrive", then please enter a new ticket. Please let us judge if the tickets are duplicates.