telepresence icon indicating copy to clipboard operation
telepresence copied to clipboard

TELEPRESENCE_ROOT doesn't contain any of the mounted secrets

Open miquella opened this issue 3 years ago • 7 comments

Describe the bug

When attempting to intercept a service, the $TELEPRESENCE_ROOT directory is empty and doesn't contain any of the pod's mounts.

During a previous debugging session, a question was raised around whether it could be related to the pod's securityContext as it was failing to create symlinks for the secrets (v2.4.3 or v2.4.4 was being used at the time). However, those log messages no longer appear in v2.4.5.

In case it's relevant, the securityContext used in our pods is as follows:

  securityContext:
    runAsUser: 1000
    fsGroup: 1000

telepresence_logs.zip

To Reproduce

Steps to reproduce the behavior:

  1. telepresence2 connect
  2. telepresence2 intercept <SVC> -- /bin/fish
  3. ls $TELEPRESENCE_ROOT
  4. Nothing is listed

Expected behavior

I would expect to see the secret mounts listed, as was true with the original version of Telepresence.

Versions (please complete the following information):

  • Output of telepresence version
    $ telepresence2 version
    Client: v2.4.5 (api v3)
    Root Daemon: v2.4.5 (api v3)
    User Daemon: v2.4.5 (api v3)
    
  • Operating system: Pop!_OS 21.04 x86_64
  • Kubernetes environment and Version: AWS EKS, v1.21.4 (have also run against GKE and KinD)

Additional context

Happy to do additional troubleshooting, if needed. I've been trying to get Telepresence2 working since ~v2.0.1–v2.0.3 or so!

miquella avatar Oct 22 '21 18:10 miquella

Solution

After spending some time digging through this today, I've been able to track down the issues that were preventing this from working.

  1. In order to use allow_root (here), the user_allow_other setting has to be enabled in /etc/fuse.conf — my distro disables this setting by default.
  2. Our pod uses a securityContext, as mentioned above. This caused the traffic-agent container to run as uid 1000, but a user with that id doesn't exist in the docker.io/datawire/tel2 image.

To solve (1), I simply added user_allow_other to my /etc/fuse.conf, but it was challenging to know this needed to be done (thanks to A for pointing this out on Slack!).

To solve (2), I added a securityContext to the traffic-agent container that matched the built-in telepresence user (uid 7777):

securityContext:
  runAsUser: 7777

Debugging The Connector

My clue was tailing ~/.cache/telepresence/logs/connector.log. This showed that sshfs was starting and then ending almost immediately and doing so every ~5 seconds or so.

I dug through enough be able to run the commands manually and see the output. The sshfs is where I saw the allow_root error:

$ sshfs -o allow_user …
fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf

Debugging SFTP Server

After resolving this, the same behavior persisted, so I checked the traffic-agent logs, I saw corresponding sftp-server invocations with the same behavior where the process ends immediately.

Opening a shell into the traffic-agent container and invoking sftp-server revealed the issue there:

$ sftp-server
No user found for uid 1000

And sure enough /etc/passwd showed no user with id 1000, but did show telepresence with id 7777, so that's what I used to get it to work, as noted above.


Notes

Both of these issues were quite difficult to track down due to the connector and traffic-agent not attaching stderr. If the errors from sshfs and sftp-server were captured somewhere, this would have been much simpler to figure out :slightly_smiling_face:

miquella avatar Jan 05 '22 21:01 miquella

We should add the information conveyed in this ticket to our troubleshooting guide.

thallgren avatar Jan 07 '22 06:01 thallgren

I created this #2285 to ensure that any output to stderr from sshfs and the sftp-server is logged as errors.

thallgren avatar Jan 07 '22 08:01 thallgren

@thallgren: Respectfully, I believe this is more than simply friction :slightly_smiling_face:

Although there was significant friction in debugging the issue (and thank you for addressing that!), there does still seem to be a bug in the way the traffic-agent container is injected into the pod when a securityContext is being used in the pod. It seems that the injected container would need to adjust the securityContext to match a user from the corresponding image, wouldn't it?

To put another way: it seems like it shouldn't be necessary to tweak the traffic-agent container definition every time it's injected in order for the container to work properly.

miquella avatar Feb 10 '22 17:02 miquella

@miquella to us, the friction label signifies that this is a problem that creates friction in the community and needs to be resolved a.s.a.p. because it inhibits adoption. So it's pretty serious. I can see how it can be perceived differently though.

I'm investigating the best way to solve this. A contributing factor is that OpenShift insist on assigning random UIDs, so we can't really specify a UID to use in the container.

thallgren avatar Feb 23 '22 09:02 thallgren

Ah, I misunderstood :smile:

I actually wondered if OpenShift would cause an issue resolving this due to the security context. Happy to help test things, if needed (although I'm not sure if I can help much on the OpenShift side).

miquella avatar Feb 23 '22 17:02 miquella

I had the same volume mount issue running Telepresence v2.5.4 on Ubuntu 20.04 and @miquella's solution of enabling user_allow_other in /etc/fuse.conf was also needed for me to get volume mounts to work. I think the documentation needs some updating. (unless a validation step can be added to the code that automatically detects the absence of this setting)

petergardfjall avatar Apr 05 '22 13:04 petergardfjall