telepresence icon indicating copy to clipboard operation
telepresence copied to clipboard

Unable to intercept a service on AWS EKS

Open albferrari opened this issue 2 years ago • 16 comments

I'm able to connect to a cluster in the current context but when I try to intercept an existing service running in a namespace I have an error and the intercept session cannot be established

To Reproduce

  1. deploy the traffic manager
telepresence helm install
  1. connect to the cluster
telepresence connect
  1. intercept a service
telepresence  intercept directory --port 8084:80 --namespace afe01 --env-file directory.env

I see this error

Error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent directory.afe01 to arrive
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent directory.afe01 to arrive

telepresence_logs.zip

Expected behavior I expect the intercept session to be created

**Versions **

  • Output of telepresence version
telepresence version
Client: v2.7.1 (api v3)
Root Daemon: v2.7.1 (api v3)
User Daemon: v2.7.1 (api v3)
  • Operating system of workstation running telepresence commands
MacOS Monterey
  • Kubernetes environment and Version [e.g. Minikube, bare metal, Google Kubernetes Engine]
Kubernetes AWS EKS

VPN-related bugs: No VPN is installed

Additional context The Kubernetes API is reachable through an ssh tunnel

albferrari avatar Aug 11 '22 20:08 albferrari

+1, getting same issue and not having luck finding an answer on this. I did not set up any optional RBAC for telepresence.

I enabled debug log level, downloaded the logs, and see the following (some non-useful lines removed for readability):

Intercept command

telepresence intercept backend --port 8080:80  --docker-run -- ealen/echo-server

Telepresence status

Root Daemon: Running
  Version   : v2.7.1 (api 3)
  DNS       :
    Remote IP       : 127.0.0.1
    Exclude suffixes: [.arpa .com .io .net .org .ru]
    Include suffixes: []
    Timeout         : 8s
  Also Proxy : (0 subnets)
  Never Proxy: (2 subnets)
    - 54.201.219.52/32
    - 54.200.13.75/32
User Daemon: Running
  Version         : v2.7.1 (api 3)
  Executable      : /Users/admin/Library/Application Support/telepresence/telepresence-pro
  Install ID      : <REDACTED>
  Ambassador Cloud:
    Status    : Logged in
    User ID   : <REDACTED>
    Account ID: <REDACTED>
    Email     : <REDACTED>
  Status            : Connected
  Kubernetes server : https://<REDACTED>.gr7.us-west-2.eks.amazonaws.com
  Kubernetes context: hubofhubs-release-us-west-2
  Intercepts        : 0 total

telepresence status

backend: ready to intercept (traffic-agent not yet installed)

traffic-manager.log

Tailing the logs of the traffic-manager pod, this is all I see:

2022-08-21 19:25:24.9142 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="aa80ba34-78ea-456f-a2b3-08061107bc57"

There are also a lot of lines like this, but not sure what to make of it... I'm on the free unlimited trial?

│ 2022-08-21 19:41:51.6443 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error                                                   │                        ││ 2022-08-21 19:41:51.9408 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error                                                   │                        │
│ 2022-08-21 19:41:52.2485 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error                                                   │                        │
│ 2022-08-21 19:41:52.5482 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error                                                   │                        │
│ 2022-08-21 19:41:52.8688 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error                                                   │                        │
│ 2022-08-21 19:41:53.1885 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error

daemon.log

2022-08-21 12:06:54.0663 debug   Logging at this level "debug"
2022-08-21 12:07:49.4877 debug   daemon/session/dns/Server : LookupHost "kubernetes.default"
2022-08-21 12:07:49.5152 debug   daemon/session/dns/Server : AAAA   kubernetes.default. -> NOERROR EMPTY
2022-08-21 12:07:49.5158 debug   daemon/session/dns/Server : A      kubernetes.default. -> NOERROR kubernetes.default.	4	IN	A	172.20.0.1
2022-08-21 12:07:49.5199 debug   daemon/session/router : ++ POOL tcp 172.20.0.0:57254 -> 172.20.0.1:443, count now is 1
2022-08-21 12:07:49.5203 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, TCP window 65535 after window update
2022-08-21 12:07:49.5206 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state IDLE -> SYN RECEIVED
2022-08-21 12:07:49.5207 debug   daemon/session/router : Opening tunnel for id tcp 172.20.0.0:57254 -> 172.20.0.1:443
2022-08-21 12:07:49.5820 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, TCP window 262144 after ack
2022-08-21 12:07:49.5821 debug   daemon/session/router :    CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, ReadLoop starting
2022-08-21 12:07:49.5822 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state SYN RECEIVED -> ESTABLISHED
2022-08-21 12:07:49.5825 debug   daemon/session/router :    CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, WriteLoop starting
2022-08-21 12:07:49.6924 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state ESTABLISHED -> TIMED WAIT
2022-08-21 12:07:49.7087 debug   daemon/session/router :    CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, ReadLoop ended
2022-08-21 12:07:49.7087 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443 TCP window is zero
2022-08-21 12:07:50.6938 debug   daemon/session/router :    CON tcp 172.20.0.0:57254 -> 172.20.0.1:443, state TIMED WAIT -> IDLE
2022-08-21 12:07:50.6939 debug   daemon/session/router :    CLI tcp 172.20.0.0:57254 -> 172.20.0.1:443, WriteLoop ended

connector.log

2022-08-21 12:14:01.8814 debug   connector/session/List-0 : replacing ReplicaSet backend-5fccb8d949.ted5b32-telepresence-matwerber1-main, with owner Deployment backend
2022-08-21 12:14:01.8815 debug   connector/session/List-0 : replacing ReplicaSet backend-5fd65c7df.ted5b32-telepresence-matwerber1-main, with owner Deployment backend
2022-08-21 12:14:01.8816 debug   connector/session/List-0 : replacing ReplicaSet backend-7969bcdcb6.ted5b32-telepresence-matwerber1-main, with owner Deployment backend
2022-08-21 12:14:01.8816 debug   connector/session/List-0 : Getting info for Deployment backend.ted5b32-telepresence-matwerber1-main, matching service backend.ted5b32-telepresence-matwerber1-main
2022-08-21 12:14:29.2198 error   connector/session/reverse-conn/client : goroutine "/connector/session/reverse-conn/client" exited with error: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR
2022-08-21 12:14:29.2212 info    connector/session/reverse-conn:shutdown_logger : shutting down (gracefully)...
2022-08-21 12:14:29.2225 debug   connector/session/reverse-conn/server : goroutine "/connector/session/reverse-conn/server" exited without error
2022-08-21 12:14:29.2226 info    connector/session/reverse-conn:shutdown_status :   final goroutine statuses:
2022-08-21 12:14:29.2226 info    connector/session/reverse-conn:shutdown_status :     /connector/session/reverse-conn/client: exited with error
2022-08-21 12:14:29.2227 info    connector/session/reverse-conn:shutdown_status :     /connector/session/reverse-conn/server: exited without error
2022-08-21 12:14:29.2227 debug   connector/session/reverse-conn : reverse-conn waiting 200ms before retrying after error: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR

matwerber1 avatar Aug 21 '22 19:08 matwerber1

Also, this raises a feature request idea: would really like to see logs and troubleshooting tips in the Ambassador UI

matwerber1 avatar Aug 21 '22 19:08 matwerber1

So two things I noticed:

  1. I'll try adjusting timeouts using the Telepresence config file described here: https://github.com/telepresenceio/telepresence.io/blob/master/docs/v2.4/reference/config.md

  2. Despite explicitly configuring~/.kube/config with the "extensions" section described in these docs, I checked the file and noticed that section of the config disappeared. It's the second time I've observed this, as the first time I thought "huh, I must have not saved it". I'm not sure if it's Telepresence, some sort of VS Code extension, or something else changing it... but I'm going to try fswatch and see if I can find out what's going on.

  extensions:
  - name: telepresence.io
    extension:
      manager:
        namespace: ambassador

matwerber1 avatar Aug 21 '22 20:08 matwerber1

We do have a weekly live help session for Telepresence with Support on Tuesdays at 2:30pm ET. If you'd like to attend please let me know. I need an email for the calendar invite which you can email to me at [email protected].

cindymullins-dw avatar Aug 22 '22 22:08 cindymullins-dw

Based on some advice provided in this slack thread: https://datawire-oss.slack.com/archives/CAUBBJSQZ/p1661549637194529

That thread, which describes this problem pretty much exactly except on a GCP cluster, includes a reference to some GCP documentation for modifying a firewall within the cluster to allow for some additional traffic to flow between the control plane and the nodes.

I tried to apply the instructions in the linked article to our EKS configuration but without any luck. According to the EKS docs when setting up the cluster security:

The default rules allow all traffic to flow freely between your cluster and nodes, and allows all outbound traffic to any destination

But I tried adding an additional rule to for port 8443 of the cluster CIDR (as described in the GCP instructions) to all of the security groups involved and it didn’t make a difference in the outcome.

It would be helpful to know exactly what request or message is not getting through. Is there any diagnostic tests we can do to help narrow down where the problem is?

sgilhooly avatar Sep 01 '22 02:09 sgilhooly

@matwerber1 @sgilhooly it would be interesting to see the complete logs from the traffic-manager after doing telepresence loglevel debug and while trying to intercept. The "Issuing a systema request without ApiKey or InstallID may result in an error" warning shouldn't matter at all (I have a PR up to remove that warning).

thallgren avatar Sep 01 '22 10:09 thallgren

I restarted the traffic-manager deployment (to restart the log file clean). I then ran

telepresence connect
telepresence list
telepresence  intercept notify --port 9190 --env-file tele-notify.env

When I remove all of the "issuing a systema request" lines from the traffic-manager log file, here's what's left:

2022-09-01 19:01:37.7648 info    Traffic Manager v2.7.1 [pid:1]
2022-09-01 19:01:37.7654 info    unable to load license: error reading license: open /home/telepresence/license: no such file or directory
2022-09-01 19:01:37.7842 info    Using DNS IP from kube-dns.kube-system
2022-09-01 19:01:37.7904 info    Using cluster domain "cluster.local."
2022-09-01 19:01:37.7936 info    Extracting service subnet 172.20.0.0/16 from create service error message
2022-09-01 19:01:37.7937 info    Using podCIDRStrategy: auto
2022-09-01 19:01:37.7952 info    prometheus : Prometheus metrics server not started
2022-09-01 19:01:37.7956 info    agent-injector : Loading ConfigMaps from []
2022-09-01 19:01:37.7957 info    agent-injector : Mutating webhook service is listening on :8443
2022-09-01 19:01:37.8940 info    Scanning 25 nodes
2022-09-01 19:01:37.8940 info    No subnets found
2022-09-01 19:01:37.8940 error   Unable to derive subnets from nodes
2022-09-01 19:01:38.0952 info    Deriving subnets from IPs of pods
2022-09-01 19:01:38.7964 info    agent-configs : Started watcher for ConfigMap telepresence-agents cluster wide
2022-09-01 19:01:38.7964 info    agent-configs : Started watcher for Services cluster wide
2022-09-01 19:01:38.8193 info    agent-configs : Successfully rolled out notify.arosa
2022-09-01 19:01:38.8387 info    agent-configs : Successfully rolled out directory.afe01
2022-09-01 19:02:17.4460 debug   Logging at this level "debug"
2022-09-01 19:02:18.4823 debug   httpd/conn=127.0.0.1:8081 : LookupHost called tel2-recursion-check.kube-system : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
2022-09-01 19:02:18.4903 debug   httpd/conn=127.0.0.1:8081 : LookupHost on traffic-manager: tel2-recursion-check.kube-system -> NOT FOUND : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
2022-09-01 19:02:25.8426 debug   httpd/conn=127.0.0.1:8081 : WatchAgentsNS called : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"
2022-09-01 19:02:27.2097 debug   agent-configs : Regenerating config entry for Deployment notify.arosa
2022-09-01 19:02:28.0958 debug   podWatcher calling updateSubnets with [10.19.128.0/17]
2022-09-01 19:03:03.0957 debug   podWatcher calling updateSubnets with [10.19.128.0/17]
2022-09-01 19:03:13.7914 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="b5634c06-cccf-4801-9e09-3a56801526c9"

sgilhooly avatar Sep 01 '22 19:09 sgilhooly

My guess is that something prevents our mutating webhook from calling the agent-injector service. That service is served by the traffic-manager and the agent injection should be logged just after the "PrepareIntercept called...". A normal debug log looks like this:

2022-09-02 21:33:07.3904 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="efb7c300-23de-4394-8fa8-3380dbd3a611"
2022-09-02 21:33:07.4126 debug   agent-configs : MODIFIED telepresence-agents.default
2022-09-02 21:33:07.4127 debug   agent-configs : add my-service.my-namespace
2022-09-02 21:33:07.4214 info    agent-configs : Successfully rolled outmy-service.my-namespace
2022-09-02 21:33:07.4454 debug   agent-injector/conn=10.42.0.179:8443 : Received webhook request...

thallgren avatar Sep 02 '22 21:09 thallgren

Logs

Re: logs, I'll see if I can dig them up, not sure I kept them - but FWIW, I (think) I looked through carefully through each file with debug-level logging, and don't think I saw anything else relevant.

Resolution

In any case, I did find the problem my namespace had a memory quota, but the sidecar injected by telepresence in the new deployment did not specify a limit.

I first used k9s to poke around the target namespace and saw that Telepresence had created a new deployment (as expected) with the sidecar traffic manager, but noted that the deployment was failing (stuck at 0/1 desired pods).

I ran the following command to watch a real-time list of events happening within the namespace:kubectl get events --namespace SOME_NAMESPACE -w

and saw Error creating :pods api-xxxx is forbidden: failed quota: quota: must specify memory:

image

We use a SaaS environment management platform (our own), to among other things, abstract and simplify K8s deployments. I had missed the fact that we apply a default resource quota to namespaces created by our platform.

I grabbed the list of quotas in the namespace via: kubectl get resourcequota --namespace YOUR_NAMESPACE, saw there was only one quota named quota, and deleted it with kubectl delete resourcequota NAME_OF_QUOTA --namespace SOME_NAMESPACE.

After that, things started working :)

Feature suggestion

It seems like the CLI will return a success as long as the new deployment is created, but (correct me if wrong) it doesn't seem like there's a way to check whether the pods are up and running with telepresence status or similar? I also didn't see indication of this in the Ambassador UI; everything seemed ready-to-go. If that's correct, adding some sort of check around pod availability for the telepresence-ready deployment might help the user experience.

Possible for us to customize the deployment manifest?

Was wondering... how does the template for the modified deployment with the traffic agent get defined? Wondering if there's an easy way I can customize that template to include a memory resource limit?

It's easy enough to manually remove the namespace quota or add a limit to a deployment after the fact, but ideally we could bake that in somewhere when deploying Telepresence?

matwerber1 avatar Sep 02 '22 21:09 matwerber1

Oh, that's interesting. I like your suggestion. We should verify that the pods are created OK and warn when they are not, and if possible, try to figure out why that is.

I think we should also detect the presence of quotas and tell the user if they block pod creation, and provide a way to specify the limit for injected sidecars.

thallgren avatar Sep 04 '22 02:09 thallgren

@thallgren: I would have thought that the mutating trigger would trigger after something has been mutated. But I don't see any activity in Kubernetes of any mutations -- of course I don't know what to look for either so very likely that it exists. When I run the telepresence intercept command what does that do which would trigger the webhook?

sgilhooly avatar Sep 06 '22 16:09 sgilhooly

The quota issue is addressed in #2773.

thallgren avatar Sep 08 '22 09:09 thallgren

@sgilhooly The mutating webhook is triggered when a workload is rolled out (typically after modifying or creating a deployment). It's task is to make the pods comply with the modification. Our agent-injector takes part in that task. When you perform an intercept, a check is made if the traffic-agent config exists in the telepresence-agents config-map. If it doesn't, an entry is generated and stored in that config-map. This, in turn, triggers a watcher that discovers this new entry and performs a roll-out of the associated workload. That triggers the mutating webhook, and hence, the agent injection.

If you don't see any events reported from kubectl get events when you perform an intercept, then that's probably because the entry already exists in the telepresence-agents config-map (you can check this by doing kubectl describe configmap telepresence-agents). A simple way to remove the entry if that's the case, is to do telepresence uninstall --agent <name of workload>. That will force a new injection attempt.

thallgren avatar Sep 08 '22 09:09 thallgren

I did notice today that running the telepresence intercept command causes some activity on the deployment I'm trying to intercept. Basically, it re-deploys the workload as if it did a "rollout restart deploy " with no new sidecar or any other noticeable difference with the newly restarted workload.

I also tried upgrading from 2.7.2 to 2.7.3 and I do now see some additional logs in the traffic-manager pod:

2022-09-09 02:50:16.6336 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="b2e3d767-47e8-4425-bff7-021079959985"
2022-09-09 02:50:16.8589 debug   agent-configs : MODIFIED telepresence-agents.mynspace
2022-09-09 02:50:16.8589 debug   agent-configs : add login.mynspace
2022-09-09 02:50:16.8735 info    agent-configs : Successfully rolled out login.mynspace

But on the client side I'm still getting the same error:

telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent login.mynspace to arrive

sgilhooly avatar Sep 09 '22 03:09 sgilhooly

+1 same issue here for a private GKE cluster. Following instructions to allow traffic to tcp/8443 did not seem to make any difference. The memory quota issue does not apply (I tested in a new namespace, no quotas on basic echo app). A new deployment or pod is never created, although I do see some volume stuff around kube-api-access and telepresence annotation are added/removed during intercept command and uninstall command. However, the sidecar for traffic-agent is never installed and deadline exceeded happens.

Note if I manually inject the sidecar and update the service, intercepts do work. Something is just not working correctly having it automatically injected.

Lastly, I tried increasing all documented timeouts to 2m from telepresence docs. However the timeout still happens after about 30 seconds which I found odd. I don't know if this particular timeout waiting for agent is covered, but agentInstall didn't seem to do it. Probably unrelated to root cause, but just something I noticed.

rholloway avatar Sep 10 '22 12:09 rholloway

+1 Got same issue with v2.7.1 & v2.7.6 clients with v2.7.1 server

Note: I had to downgrade to v2.5.8 both client & server side components to get it works.

nurhun avatar Sep 21 '22 16:09 nurhun

I've pretty much hit a dead end here. I don't know what additional information to gather, what else to try, or where else to look. It seems like something here is swallowing up an error but I'm probably going to have to spend a great deal of time getting familiar with the source code to figure anything else out.

sgilhooly avatar Sep 21 '22 18:09 sgilhooly

Please try to add a firewall rule on the cluster and on the corresponding network so tcp traffic is allowed on port 8443. That's the port that the traffic-manager container uses to serve webhooks. This solved the issue for me on GKE, and I was getting exactly the same error!

ipeychev avatar Sep 22 '22 12:09 ipeychev

@ipeychev : Thanks for the tip but this was the first thing we looked into when we ran into this issue (after hearing other's success with it). But, on AWS/EKS the default security group configuration already lets all traffic pass freely between the control plane and the nodes so it doesn't appear to the be exact same issue. Based on what I can decipher from the logs it also does appear that the webhook is being received by the traffic manager -- so the remaining mystery is why the sidecar is not getting installed.

sgilhooly avatar Sep 22 '22 13:09 sgilhooly

Telepresence 2.7.6 has been released. We've love for users who encountered the "Deadline Exceeded error" to try it and provide feedback.

cindymullins-dw avatar Sep 23 '22 16:09 cindymullins-dw

I get the same outcome on 2.7.6: The logs have a small bit of additional entries but otherwise the result is the same:

Command:
$>  telepresence version
Client: v2.7.6 (api v3)
Root Daemon: v2.7.6 (api v3)
User Daemon: v2.7.6 (api v3)

$> telepresence intercept login --port 9190 --env-file tele-login.env
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent login.mynamespace to arrive

See logs for details (1 error found): "/Users/seangilhooly/Library/Logs/telepresence/connector.log"
If you think you have encountered a bug, please run `telepresence gather-logs` and attach the telepresence_logs.zip to your github issue or create a new one: https://github.com/telepresenceio/telepresence/issues/new?template=Bug_report.md .
Traffic-manager debug log:
2022-09-23 18:33:50.8599 info    agent-configs : Successfully rolled out login.mynamespace
2022-09-23 18:33:54.1502 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:33:54.1767 debug   agent-configs : MODIFIED telepresence-agents.mynamespace
2022-09-23 18:33:54.1767 debug   agent-configs : add login.mynamespace
2022-09-23 18:33:54.1902 info    agent-configs : Successfully rolled out login.mynamespace
2022-09-23 18:33:58.0473 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: Get "[http://10.19.188.130:80/status/](http://10.19.188.130/status/)": dial tcp 10.19.188.130:80: connect: connection refused : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:34:01.1096 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: Get "[http://10.19.223.58:80/status/](http://10.19.223.58/status/)": dial tcp 10.19.223.58:80: connect: connection refused : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:34:03.7923 debug   podWatcher calling updateSubnets with [10.19.128.0/17]
2022-09-23 18:34:06.8101 error   httpd/conn=127.0.0.1:8081 : Liveness probe failed: Get "[http://10.19.188.130:80/status/](http://10.19.188.130/status/)": EOF : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"
2022-09-23 18:34:06.8193 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: Get "[http://10.19.188.130:80/status/](http://10.19.188.130/status/)": EOF : session_id="dffd2c77-c3f1-43ec-aff6-bafdf5757676"

And as before the “deadline exceeded” error returns after 30 seconds despite the timeout settings I put in my config.yml

$> cat  ~/Library/Application\ Support/telepresence/config.yml
timeouts:
  agentInstall: 2m
  intercept: 120s
logLevels:
  userDaemon: trace
  rootDaemon: debug

Also, as before, I see the deployment I'm trying to intercept gets a rollout restart, but no sidecar is injected.

sgilhooly avatar Sep 23 '22 19:09 sgilhooly

@sgilhooly , thanks for testing! It seems your traffic manager is failing liveness and readiness probes. The Traffic Manager will be injecting the Traffic Agent so it seems there’s something up with the traffic manager. Can you share the EVENTS of the Traffic Manager pod - does it show these same errors or any additional information?

cindymullins-dw avatar Oct 19 '22 17:10 cindymullins-dw

@cindymullins-dw: thanks for looking into it. I upgraded to the latest telepresence version and tried again to gather the information you requested. I don't see any complaints or interesting info from the traffic manager (but perhaps I misunderstood your request). After reproducing the problem, here are all the events I get from the traffic manager:

$> kubectl get event --namespace ambassador --field-selector involvedObject.name=traffic-manager-577f4c895c-r8695
LAST SEEN   TYPE     REASON      OBJECT                                 MESSAGE
12m         Normal   Scheduled   pod/traffic-manager-577f4c895c-r8695   Successfully assigned ambassador/traffic-manager-577f4c895c-r8695 to ip-10-19-253-217.ec2.internal
12m         Normal   Pulling     pod/traffic-manager-577f4c895c-r8695   Pulling image "docker.io/datawire/tel2:2.8.2"
12m         Normal   Pulled      pod/traffic-manager-577f4c895c-r8695   Successfully pulled image "docker.io/datawire/tel2:2.8.2" in 1.331321165s
12m         Normal   Created     pod/traffic-manager-577f4c895c-r8695   Created container traffic-manager
12m         Normal   Started     pod/traffic-manager-577f4c895c-r8695   Started container traffic-manager

The new version of telepresence has additional information in the error message:

$> telepresence intercept login --port 9190 --env-file tele-login.env
Launching Telepresence Root Daemon
Need root privileges to run: ~/tools/telep/bin/telepresence daemon-foreground ~/Library/Logs/telepresence '~/Library/Application Support/telepresence'
Password:
telepresence: error: request timed out while waiting for agent login.mynamespace to arrive: Events that may be relevant:
AGE   TYPE      REASON      OBJECT                       MESSAGE
21s   Warning   Unhealthy   pod/login-78d9df5746-v9mr8   Readiness probe failed: Get "http://10.19.233.42:80/status/": dial tcp 10.19.233.42:80: connect: connection refused
14s   Warning   Unhealthy   pod/login-78d9df5746-v9mr8   Readiness probe failed: Get "http://10.19.233.42:80/status/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
13s   Warning   Unhealthy   pod/login-78d9df5746-v9mr8   Liveness probe failed: Get "http://10.19.233.42:80/status/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

sgilhooly avatar Oct 21 '22 03:10 sgilhooly

I had the same issue: Error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent xxxx.xxx to arrive Was able to solve by:

  1. changing the namespace of the traffic manager (under ~/.kube/config)
    extensions:
    - name: telepresence.io
      extension:
        manager:
          namespace: ambassador-new
  1. telepresence helm install
  2. telepresence intercept my-service --port 7654:7654 --env-file my-service_env.txt

kobybm avatar Nov 09 '22 22:11 kobybm

Happy to report that after upgrading to version v2.9.2 this problem seems to have been fixed. Intercepting a service was successful (although I had a follow on problem with getting telepresence to recognize/support macFUSE which I had to work through separately). I no longer see any timeout issue and the intercept completes successfully and rather quickly now.

sgilhooly avatar Nov 21 '22 15:11 sgilhooly

I'm facing the same issue with version 2.9.4. When I try to intercept a service I'm getting this:

$ telepresence intercept security-authorization -n flex --port 8080:api
telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent security-authorization.flex to arrive: Events that may be relevant:
AGE   TYPE      REASON      OBJECT                                        MESSAGE
30s   Warning   Unhealthy   pod/security-authorization-57c84fdbb6-4dcp2   Readiness probe failed: HTTP probe failed with statuscode: 503
30s   Warning   Unhealthy   pod/security-authorization-57c84fdbb6-4dcp2   Readiness probe failed: HTTP probe failed with statuscode: 503
27s   Warning   Unhealthy   pod/security-authorization-57c84fdbb6-4dcp2   Readiness probe failed: Get "http://192.168.183.179:15020/app-health/security-authorization/readyz": dial tcp 192.168.183.179:15020: connect: connection refused
30s   Warning   Unhealthy   pod/security-authorization-57c84fdbb6-4dcp2   Readiness probe failed: HTTP probe failed with statuscode: 503
14s   Warning   Unhealthy   pod/security-authorization-6d6c5d6d8c-gkkmz   Readiness probe failed: HTTP probe failed with statuscode: 503
14s   Warning   Unhealthy   pod/security-authorization-6d6c5d6d8c-gkkmz   Readiness probe failed: HTTP probe failed with statuscode: 503
14s   Warning   Unhealthy   pod/security-authorization-6d6c5d6d8c-gkkmz   Readiness probe failed: HTTP probe failed with statuscode: 503

An interest thing, if I change my traffic-manager deployment to has hostNetowrk: true, the intercepting starts to work. But with this, if I try just to intercept with --local-only the dns doesn't resolves cluster services, this is the log printed in traffic-manager when hostNetwork: true and intercepting local-only:

│ 2022-12-06 20:25:01.7112 debug   httpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. A : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"                                             │
│ 2022-12-06 20:25:01.7118 debug   httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. A -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"             │
│ 2022-12-06 20:25:01.7121 debug   httpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. AAAA : session_ttpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. A : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"                                             │
│ 2022-12-06 20:25:01.7118 debug   httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. A -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"             │
│ 2022-12-06 20:25:01.7121 debug   httpd/conn=127.0.0.1:8081 : LookupDNS legacy-gateway.flex.svc.cluster.local. AAAA : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"                                          │
│ 2022-12-06 20:25:01.7124 debug   httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. AAAA -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" 
```id="5f51dec8-838f-43a0-baa3-8c54bfcf6801"                                          │
│ 2022-12-06 20:25:01.7124 debug   httpd/conn=127.0.0.1:8081 : LookupDNS on traffic-manager: legacy-gateway.flex.svc.cluster.local. AAAA -> NXDOMAIN : session_id="5f51dec8-838f-43a0-baa3-8c54bfcf6801" 

If I remove the hostNetwork the --local-only starts to work again, but the service intercept stops again.

betonetotbo avatar Dec 06 '22 20:12 betonetotbo

I think this ticket is all over the place now. It seems to act as a bin for all problems related to timeouts when intercepting. Some of which has been successfully resolved. I'm therefore closing this ticket. If you still have problems with "request timed out while waiting for agent xxx.xxx to arrive", then please enter a new ticket. Please let us judge if the tickets are duplicates.

thallgren avatar Dec 06 '22 20:12 thallgren