telepresence
telepresence copied to clipboard
v2.6.x crashes on tel-agent-init container start
Using telepresence in k3s cluster which is setup using colima on Mac with arm processor.
Colima uses alpine-lima for VM and k3s for kubernetes cluster.
telepresence intercept ...
Adds the traffic-manager
which starts normally.
The target deployment is also updated to contain tel-agent-init
and traffic-agent
.
Then tel-agent-init
fails to start with the following log:
info Traffic Agent Init v2.6.2
error failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': iptables who? (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.
error quit: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': iptables who? (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.
I am not sure if it is related to update to the underlying VM or k3s or telepresence. I've tried running the following commands in the VM:
sudo modprobe ip_tables
sudo echo 'ip_tables' >> /etc/modules
But it didn't help.
Any clue what is the missing piece here for the above error?
FYI v2.5.8 seems to work properly without issues:
info Traffic Agent v2.5.8 [pid:1]
info {Name:account-data-service Namespace:default PodIP:172.17.0.34 AgentPort:9900 AppMounts:/tel_app_mounts AppPort:8080 ManagerHost:traffic-manager.ambassador ManagerPort:8081 APIPort:0}
info new agent secrets mount path: /var/run/secrets/kubernetes.io
info client : Connected to Manager 2.5.8
info client : Setting intercept "3e1be60b-5ad8-4351-8955-7ec448933832:account-data-service" as ACTIVE
The reason 2.5.8 works is probably that it modifies the actual service to appoint the traffic-agent's port instead of the intercepted container's port. In 2.6.x, we no longer modify services and deployments. Instead, we always use the mutating webhook injector. Since the service then cannot be modified, we instead inject an init container to alter the iptables. Which should work, but obviously doesn't in your case.
I'll try setting up colima on my M1 and see if I can reproduce.
Depending on use-case, a workaround until this is fixed may be to use a symbolic targetPort
in the service ports instead of a numeric one. This is not possible for a headless service though.
I'm able to reproduce this on my M1, but not on any other platform.
I also have a similar issue with M1 silicon:
2022-05-24 20:18:01.5877 info Traffic Agent Init v2.6.4
2022-05-24 20:18:01.6511 error failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied
2022-05-24 20:18:01.6533 error quit: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied
We've seen a similar issue with ubuntu
2022-06-22 20:39:49.2879 info Traffic Agent Init v1.12.6)
2022-06-22 20:39:49.2920 error failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied
error: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied
Does the init container need to run as root?
Unless you are using headless services, it's likely that there's a way around this. There's often no need for an init-container. Please read Injected init-container doesn't function properly in our Troubleshooting guide.
Thanks, that did fix the init container issue. The sidecar now crashes due to
error: mkdir /tel_app_exports/app/var/run: read-only file system
Looks like we have to set readOnlyRootFilesystem
to false?
Update: After setting it to false a new problem:
error: mkdir /tel_app_exports/app/var/run: permission denied
hmm, can it run as non-root?
It should be able to run as any user. We inject this volume:
- emptyDir: {}
name: export-volume
and th traffic-agent has the following volumeMount:
- mountPath: /tel_app_exports
name: export-volume
Hard to see why it wouldn't be able to write to that volume.
Looks like due to a hostpath mount of the pod under /var/xyz
, /tel_app_mounts/app/var
becomes owned by root.
ls -la /tel_app_exports/app
lrwxrwxrwx 1 1001 1001 24 Jun 23 16:00 var -> /tel_app_mounts/app/var
ls -la /tel_app_mounts/app/var
drwxr-xr-x 3 root root 4096 Jun 23 16:15 .
drwxr-xr-x 3 root root 4096 Jun 23 16:15 xyz
@thallgren do you think it would be fine to add a way to let the webhook ignore certain volume mounts?
This thread is likely the wrong place to discussed, hence I created https://github.com/telepresenceio/telepresence/pull/2665
@thallgren the symbolic targetPort
workaround doesn't work for sidecar containers making requests to localhost. If iptables
isn't so reliable, maybe the old approach with the deployment and service modification could be kept as a method of interception?
Well, the old approach did just that. It made the targetPort symbolic, so if that doesn't work now, it didn't work before either. Did you try using the --to-pod
flag for the localhost port?
The --to-pod
flag is for making the intercepted service be able to call a sidecar in the cluster. However my issue is the opposite. A sidecar makes calls to localhost for posting work to the intercepted service. This doesn't seem to be forwarded and you may be correct that it didn't also work in <=2.5.8. This worked properly in v0.109.
Aha, then I misunderstood sorry. And no, that has never worked. I guess we could make it work, but the solution would indeed require iptables
. It would also require that the container that listens to the port, declares it as with a containerPort
so that we know about it.
The only viable solution if you want to get rid of iptables
, is to expose the port with a service that uses a symbolic targetPort
and have the container call that instead of the port on localhost.
I actually tried the second one and it didn't work. The request made from a sidecar to a service targeting the same pod was hanging. Not sure what is wrong there.
Given that we fix iptables on colima. Would it be hard to implement what you described above?
It's not trivial. If we do it, then it will probably be in conjunction with fixing this ticket.
@thallgren regarding this:
The only viable solution if you want to get rid of iptables, is to expose the port with a service that uses a symbolic targetPort and have the container call that instead of the port on localhost.
Maybe it is a limitation of Kubernetes.
Trying another workaround. It is an option for me to setup the sidecar container with an env var, pointing it to make a request to the telepresence-agent container directly on port 9900. This way it does proxy it to the local process and works fine for containers listening on a single port. However, when the container is listening on two separate ports, how would I make the request to a specific port?
If you're using telepresence 2.6.x, you can take a look at the mappings that Telepresence finds for each workload by doing a kubectl describe configmap telepresence-agents
. It will show you what agent port that is mapped to what container port.
Tested and confirming its working. One last thing, can you point me to the code which does the mutation of the manifests during an agent inject? I am thinking of making a change there to add the env vars for the intercepted ports to the sidecar container.
Edit: found it in cmd/traffic/cmd/manager/internal/mutator/agent_injector.go
One thing just struck me. If you have problem with another sidecar that is injected and using symbolic ports just doesn't work, then that's probably because the other sidecar is injected before the telepresence traffic-agent, and never gets aware of its presence. If it did, it would just work. That situation can be corrected by a helm chart setting.
Try setting agentInjector.webhook.reinvocationPolicy
to IfNeeded
. That will reinvoke the injection of all other containers and should make that sidecar realize that it must redirect to the traffic-agent, becase it's the traffic-agent that has the correctly named port at that time.
You'll find the starting point of the injection code here.
The other sidecar isn't injected, but is part of the pod template.
Trying to build the tel2 image fails:
make tel2 TELEPRESENCE_VERSION=v2.6.8
[make] TELEPRESENCE_VERSION=v2.6.8
mkdir -p build-output
printf v2.6.8 > build-output/version.txt ## Pass version in a file instead of a --build-arg to maximize cache usage
docker build --target tel2 --tag tel2 --tag docker.io/datawire/tel2:2.6.8 -f base-image/Dockerfile .
[+] Building 22.6s (20/23)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.99kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 35B 0.0s
=> [internal] load metadata for docker.io/library/alpine:3.15 0.6s
=> [internal] load metadata for docker.io/library/golang:alpine3.15 0.6s
=> [tel2-base 1/8] FROM docker.io/library/golang:alpine3.15@sha256:f9181168749690bddb6751b004e976bf5d427425e0cfb50522e92c06f761def7 0.0s
=> [tel2 1/4] FROM docker.io/library/alpine:3.15@sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454 0.0s
=> [internal] load build context 0.1s
=> => transferring context: 265.81kB 0.1s
=> CACHED [tel2 2/4] RUN apk add --no-cache ca-certificates iptables 0.0s
=> CACHED [tel2-base 2/8] RUN apk add --no-cache gcc musl-dev 0.0s
=> CACHED [tel2-base 3/8] WORKDIR telepresence 0.0s
=> CACHED [tel2-base 4/8] COPY go.mod . 0.0s
=> CACHED [tel2-base 5/8] COPY go.sum . 0.0s
=> CACHED [tel2-base 6/8] COPY rpc/go.mod rpc/ 0.0s
=> CACHED [tel2-base 7/8] COPY rpc/go.sum rpc/ 0.0s
=> CACHED [tel2-base 8/8] RUN go mod download 0.0s
=> CACHED [tel2-build 1/6] COPY cmd/ cmd/ 0.0s
=> CACHED [tel2-build 2/6] COPY pkg/ pkg/ 0.0s
=> CACHED [tel2-build 3/6] COPY rpc/ rpc/ 0.0s
=> CACHED [tel2-build 4/6] COPY build-output/version.txt . 0.0s
=> ERROR [tel2-build 5/6] RUN go install -trimpath -ldflags=-X=$(go list ./pkg/version).Version=$(cat version.txt) ./cmd/traffic/... 21.8s
------
> [tel2-build 5/6] RUN go install -trimpath -ldflags=-X=$(go list ./pkg/version).Version=$(cat version.txt) ./cmd/traffic/...:
#20 21.47 # github.com/telepresenceio/telepresence/v2/cmd/traffic
#20 21.47 /usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1
#20 21.47 collect2: fatal error: cannot find 'ld'
#20 21.47 compilation terminated.
#20 21.47
------
executor failed running [/bin/sh -c go install -trimpath -ldflags=-X=$(go list ./pkg/version).Version=$(cat version.txt) ./cmd/traffic/...]: exit code: 2
make: *** [tel2] Error 1
Any idea?
Disabling CGO here fixes it. The image doesn't appear to be bigger. Why was gcc build preferred?
Not sure why the build fails for you. We don't disable CGO in our CI builds, nor do we disable it during development. The reason it's preferred is that you get proper DNS resolution, not the Go variant which have given us some grief in the past.
I usually just do make push-image
with the TELEPRESENCE_REGISTRY
set to a local registry.
Probably something specific to arm64: /usr/local/go/pkg/tool/linux_arm64/link: running gcc failed
I've noticed that telepresense doesn't have arm images. The tel2 and the ambassador-telepresence-agent are amd64 and run through emulation on arm.
Maybe this is related to this issue.
I defined a securtyContext
on spec.template.spec
level which will be used for the init containers too.
After i removed the securityContext
the init-container was not crashing anymore.
apiVersion: apps/v1
kind: StatefulSet
spec:
template:
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
Maybe you can define the securityContext
on the init container to run as root.
Yeah I'm running into the same error on Openshift 4.10 w/ Telepresence 2.7.2. Tried granting all the permissions possible with no change.
2022-08-25 23:14:45.8931 info Traffic Agent Init v1.12.10
2022-08-25 23:14:45.8957 error failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.
error: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.
Looks like the iptables kernel module isn't loaded because they use nftables.
I found an old discussion from Istio about it.
https://github.com/istio/istio/issues/13986
Same here. I use Rancher Desktop on my M2 MacBook Air and the version of telepresence is v2.7.2.
any solution for this?