telepresence
telepresence copied to clipboard
TELEPRESENCE_ROOT doesn't contain any of the mounted secrets
Describe the bug
When attempting to intercept a service, the $TELEPRESENCE_ROOT
directory is empty and doesn't contain any of the pod's mounts.
During a previous debugging session, a question was raised around whether it could be related to the pod's securityContext
as it was failing to create symlinks for the secrets (v2.4.3 or v2.4.4 was being used at the time). However, those log messages no longer appear in v2.4.5.
In case it's relevant, the securityContext
used in our pods is as follows:
securityContext:
runAsUser: 1000
fsGroup: 1000
To Reproduce
Steps to reproduce the behavior:
-
telepresence2 connect
-
telepresence2 intercept <SVC> -- /bin/fish
-
ls $TELEPRESENCE_ROOT
- Nothing is listed
Expected behavior
I would expect to see the secret mounts listed, as was true with the original version of Telepresence.
Versions (please complete the following information):
- Output of
telepresence version
$ telepresence2 version Client: v2.4.5 (api v3) Root Daemon: v2.4.5 (api v3) User Daemon: v2.4.5 (api v3)
- Operating system:
Pop!_OS 21.04 x86_64
- Kubernetes environment and Version: AWS EKS, v1.21.4 (have also run against GKE and KinD)
Additional context
Happy to do additional troubleshooting, if needed. I've been trying to get Telepresence2 working since ~v2.0.1–v2.0.3 or so!
Solution
After spending some time digging through this today, I've been able to track down the issues that were preventing this from working.
- In order to use
allow_root
(here), theuser_allow_other
setting has to be enabled in/etc/fuse.conf
— my distro disables this setting by default. - Our pod uses a
securityContext
, as mentioned above. This caused thetraffic-agent
container to run as uid1000
, but a user with that id doesn't exist in thedocker.io/datawire/tel2
image.
To solve (1), I simply added user_allow_other
to my /etc/fuse.conf
, but it was challenging to know this needed to be done (thanks to A for pointing this out on Slack!).
To solve (2), I added a securityContext
to the traffic-agent
container that matched the built-in telepresence
user (uid 7777
):
securityContext:
runAsUser: 7777
Debugging The Connector
My clue was tailing ~/.cache/telepresence/logs/connector.log
. This showed that sshfs
was starting and then ending almost immediately and doing so every ~5 seconds or so.
I dug through enough be able to run the commands manually and see the output. The sshfs
is where I saw the allow_root
error:
$ sshfs -o allow_user …
fusermount: option allow_other only allowed if 'user_allow_other' is set in /etc/fuse.conf
Debugging SFTP Server
After resolving this, the same behavior persisted, so I checked the traffic-agent
logs, I saw corresponding sftp-server
invocations with the same behavior where the process ends immediately.
Opening a shell into the traffic-agent
container and invoking sftp-server
revealed the issue there:
$ sftp-server
No user found for uid 1000
And sure enough /etc/passwd
showed no user with id 1000
, but did show telepresence
with id 7777
, so that's what I used to get it to work, as noted above.
Notes
Both of these issues were quite difficult to track down due to the connector
and traffic-agent
not attaching stderr. If the errors from sshfs
and sftp-server
were captured somewhere, this would have been much simpler to figure out :slightly_smiling_face:
We should add the information conveyed in this ticket to our troubleshooting guide.
I created this #2285 to ensure that any output to stderr
from sshfs
and the sftp-server
is logged as errors.
@thallgren: Respectfully, I believe this is more than simply friction :slightly_smiling_face:
Although there was significant friction in debugging the issue (and thank you for addressing that!), there does still seem to be a bug in the way the traffic-agent
container is injected into the pod when a securityContext
is being used in the pod. It seems that the injected container would need to adjust the securityContext
to match a user from the corresponding image, wouldn't it?
To put another way: it seems like it shouldn't be necessary to tweak the traffic-agent
container definition every time it's injected in order for the container to work properly.
@miquella to us, the friction label signifies that this is a problem that creates friction in the community and needs to be resolved a.s.a.p. because it inhibits adoption. So it's pretty serious. I can see how it can be perceived differently though.
I'm investigating the best way to solve this. A contributing factor is that OpenShift insist on assigning random UIDs, so we can't really specify a UID to use in the container.
Ah, I misunderstood :smile:
I actually wondered if OpenShift would cause an issue resolving this due to the security context. Happy to help test things, if needed (although I'm not sure if I can help much on the OpenShift side).
I had the same volume mount issue running Telepresence v2.5.4
on Ubuntu 20.04 and @miquella's solution of enabling user_allow_other
in /etc/fuse.conf
was also needed for me to get volume mounts to work. I think the documentation needs some updating. (unless a validation step can be added to the code that automatically detects the absence of this setting)