telepresence v2.6.x crashes on tel-agent-init container start

Using telepresence in k3s cluster which is setup using colima on Mac with arm processor.

Colima uses alpine-lima for VM and k3s for kubernetes cluster.

telepresence intercept ...

Adds the traffic-manager which starts normally. The target deployment is also updated to contain tel-agent-init and traffic-agent.

Then tel-agent-init fails to start with the following log:

info    Traffic Agent Init v2.6.2
error   failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': iptables who? (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.
error   quit: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': iptables who? (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.

I am not sure if it is related to update to the underlying VM or k3s or telepresence. I've tried running the following commands in the VM:

sudo modprobe ip_tables
sudo echo 'ip_tables' >> /etc/modules

But it didn't help.

Any clue what is the missing piece here for the above error?

May 19 '22 14:05 ventsislav-georgiev

FYI v2.5.8 seems to work properly without issues:

info    Traffic Agent v2.5.8 [pid:1]
info    {Name:account-data-service Namespace:default PodIP:172.17.0.34 AgentPort:9900 AppMounts:/tel_app_mounts AppPort:8080 ManagerHost:traffic-manager.ambassador ManagerPort:8081 APIPort:0}
info    new agent secrets mount path: /var/run/secrets/kubernetes.io
info    client : Connected to Manager 2.5.8
info    client : Setting intercept "3e1be60b-5ad8-4351-8955-7ec448933832:account-data-service" as ACTIVE

May 20 '22 08:05 ventsislav-georgiev

The reason 2.5.8 works is probably that it modifies the actual service to appoint the traffic-agent's port instead of the intercepted container's port. In 2.6.x, we no longer modify services and deployments. Instead, we always use the mutating webhook injector. Since the service then cannot be modified, we instead inject an init container to alter the iptables. Which should work, but obviously doesn't in your case.

I'll try setting up colima on my M1 and see if I can reproduce.

May 20 '22 09:05 thallgren

Depending on use-case, a workaround until this is fixed may be to use a symbolic targetPort in the service ports instead of a numeric one. This is not possible for a headless service though.

May 20 '22 09:05 thallgren

I'm able to reproduce this on my M1, but not on any other platform.

May 23 '22 12:05 thallgren

I also have a similar issue with M1 silicon:

2022-05-24 20:18:01.5877 info    Traffic Agent Init v2.6.4
2022-05-24 20:18:01.6511 error   failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied

2022-05-24 20:18:01.6533 error   quit: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied

May 24 '22 20:05 kavinaravind

We've seen a similar issue with ubuntu

2022-06-22 20:39:49.2879 info    Traffic Agent Init v1.12.6)
2022-06-22 20:39:49.2920 error   failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied

error: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 4: Fatal: can't open lock file /run/xtables.lock: Permission denied

Does the init container need to run as root?

Jun 22 '22 20:06 kolorful

Unless you are using headless services, it's likely that there's a way around this. There's often no need for an init-container. Please read Injected init-container doesn't function properly in our Troubleshooting guide.

Jun 23 '22 10:06 thallgren

Thanks, that did fix the init container issue. The sidecar now crashes due to

error: mkdir /tel_app_exports/app/var/run: read-only file system

Looks like we have to set readOnlyRootFilesystem to false?

Update: After setting it to false a new problem:

error: mkdir /tel_app_exports/app/var/run: permission denied

hmm, can it run as non-root?

Jun 23 '22 14:06 kolorful

It should be able to run as any user. We inject this volume:

  - emptyDir: {}
    name: export-volume

and th traffic-agent has the following volumeMount:

    - mountPath: /tel_app_exports
      name: export-volume

Hard to see why it wouldn't be able to write to that volume.

Jun 23 '22 14:06 thallgren

Looks like due to a hostpath mount of the pod under /var/xyz, /tel_app_mounts/app/var becomes owned by root.

ls -la /tel_app_exports/app
lrwxrwxrwx    1 1001     1001            24 Jun 23 16:00 var -> /tel_app_mounts/app/var

ls -la /tel_app_mounts/app/var
drwxr-xr-x    3 root     root          4096 Jun 23 16:15 .
drwxr-xr-x    3 root     root          4096 Jun 23 16:15 xyz

@thallgren do you think it would be fine to add a way to let the webhook ignore certain volume mounts?

This thread is likely the wrong place to discussed, hence I created https://github.com/telepresenceio/telepresence/pull/2665

Jun 23 '22 16:06 kolorful

@thallgren the symbolic targetPort workaround doesn't work for sidecar containers making requests to localhost. If iptables isn't so reliable, maybe the old approach with the deployment and service modification could be kept as a method of interception?

Jun 24 '22 15:06 ventsislav-georgiev

Well, the old approach did just that. It made the targetPort symbolic, so if that doesn't work now, it didn't work before either. Did you try using the --to-pod flag for the localhost port?

Jun 24 '22 16:06 thallgren

The --to-pod flag is for making the intercepted service be able to call a sidecar in the cluster. However my issue is the opposite. A sidecar makes calls to localhost for posting work to the intercepted service. This doesn't seem to be forwarded and you may be correct that it didn't also work in <=2.5.8. This worked properly in v0.109.

Jun 24 '22 18:06 ventsislav-georgiev

Aha, then I misunderstood sorry. And no, that has never worked. I guess we could make it work, but the solution would indeed require iptables. It would also require that the container that listens to the port, declares it as with a containerPort so that we know about it.

The only viable solution if you want to get rid of iptables, is to expose the port with a service that uses a symbolic targetPort and have the container call that instead of the port on localhost.

Jun 25 '22 05:06 thallgren

I actually tried the second one and it didn't work. The request made from a sidecar to a service targeting the same pod was hanging. Not sure what is wrong there.

Given that we fix iptables on colima. Would it be hard to implement what you described above?

Jun 25 '22 06:06 ventsislav-georgiev

It's not trivial. If we do it, then it will probably be in conjunction with fixing this ticket.

Jun 25 '22 06:06 thallgren

@thallgren regarding this:

The only viable solution if you want to get rid of iptables, is to expose the port with a service that uses a symbolic targetPort and have the container call that instead of the port on localhost.

Maybe it is a limitation of Kubernetes.

Trying another workaround. It is an option for me to setup the sidecar container with an env var, pointing it to make a request to the telepresence-agent container directly on port 9900. This way it does proxy it to the local process and works fine for containers listening on a single port. However, when the container is listening on two separate ports, how would I make the request to a specific port?

Jun 25 '22 09:06 ventsislav-georgiev

If you're using telepresence 2.6.x, you can take a look at the mappings that Telepresence finds for each workload by doing a kubectl describe configmap telepresence-agents. It will show you what agent port that is mapped to what container port.

Jun 25 '22 09:06 thallgren

Tested and confirming its working. One last thing, can you point me to the code which does the mutation of the manifests during an agent inject? I am thinking of making a change there to add the env vars for the intercepted ports to the sidecar container.

Edit: found it in cmd/traffic/cmd/manager/internal/mutator/agent_injector.go

Jun 25 '22 10:06 ventsislav-georgiev

One thing just struck me. If you have problem with another sidecar that is injected and using symbolic ports just doesn't work, then that's probably because the other sidecar is injected before the telepresence traffic-agent, and never gets aware of its presence. If it did, it would just work. That situation can be corrected by a helm chart setting.

Try setting agentInjector.webhook.reinvocationPolicy to IfNeeded. That will reinvoke the injection of all other containers and should make that sidecar realize that it must redirect to the traffic-agent, becase it's the traffic-agent that has the correctly named port at that time.

Jun 25 '22 14:06 thallgren

You'll find the starting point of the injection code here.

Jun 25 '22 14:06 thallgren

The other sidecar isn't injected, but is part of the pod template.

Trying to build the tel2 image fails:

make tel2 TELEPRESENCE_VERSION=v2.6.8                                                                                                       
[make] TELEPRESENCE_VERSION=v2.6.8
mkdir -p build-output
printf v2.6.8 > build-output/version.txt ## Pass version in a file instead of a --build-arg to maximize cache usage
docker build --target tel2 --tag tel2 --tag docker.io/datawire/tel2:2.6.8 -f base-image/Dockerfile .
[+] Building 22.6s (20/23)                                                                                                                                                         
 => [internal] load build definition from Dockerfile                                                                                                                          0.0s
 => => transferring dockerfile: 1.99kB                                                                                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                             0.0s
 => => transferring context: 35B                                                                                                                                              0.0s
 => [internal] load metadata for docker.io/library/alpine:3.15                                                                                                                0.6s
 => [internal] load metadata for docker.io/library/golang:alpine3.15                                                                                                          0.6s
 => [tel2-base 1/8] FROM docker.io/library/golang:alpine3.15@sha256:f9181168749690bddb6751b004e976bf5d427425e0cfb50522e92c06f761def7                                          0.0s
 => [tel2 1/4] FROM docker.io/library/alpine:3.15@sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454                                                     0.0s
 => [internal] load build context                                                                                                                                             0.1s
 => => transferring context: 265.81kB                                                                                                                                         0.1s
 => CACHED [tel2 2/4] RUN apk add --no-cache ca-certificates iptables                                                                                                         0.0s
 => CACHED [tel2-base 2/8] RUN apk add --no-cache gcc musl-dev                                                                                                                0.0s
 => CACHED [tel2-base 3/8] WORKDIR telepresence                                                                                                                               0.0s
 => CACHED [tel2-base 4/8] COPY go.mod .                                                                                                                                      0.0s
 => CACHED [tel2-base 5/8] COPY go.sum .                                                                                                                                      0.0s
 => CACHED [tel2-base 6/8] COPY rpc/go.mod rpc/                                                                                                                               0.0s
 => CACHED [tel2-base 7/8] COPY rpc/go.sum rpc/                                                                                                                               0.0s
 => CACHED [tel2-base 8/8] RUN go mod download                                                                                                                                0.0s
 => CACHED [tel2-build 1/6] COPY cmd/ cmd/                                                                                                                                    0.0s
 => CACHED [tel2-build 2/6] COPY pkg/ pkg/                                                                                                                                    0.0s
 => CACHED [tel2-build 3/6] COPY rpc/ rpc/                                                                                                                                    0.0s
 => CACHED [tel2-build 4/6] COPY build-output/version.txt .                                                                                                                   0.0s
 => ERROR [tel2-build 5/6] RUN go install -trimpath -ldflags=-X=$(go list ./pkg/version).Version=$(cat version.txt) ./cmd/traffic/...                                        21.8s
------                                                                                                                                                                             
 > [tel2-build 5/6] RUN go install -trimpath -ldflags=-X=$(go list ./pkg/version).Version=$(cat version.txt) ./cmd/traffic/...:                                                    
#20 21.47 # github.com/telepresenceio/telepresence/v2/cmd/traffic                                                                                                                  
#20 21.47 /usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1                                                                                               
#20 21.47 collect2: fatal error: cannot find 'ld'
#20 21.47 compilation terminated.
#20 21.47 
------
executor failed running [/bin/sh -c go install -trimpath -ldflags=-X=$(go list ./pkg/version).Version=$(cat version.txt) ./cmd/traffic/...]: exit code: 2
make: *** [tel2] Error 1

Any idea?

Jun 25 '22 16:06 ventsislav-georgiev

Disabling CGO here fixes it. The image doesn't appear to be bigger. Why was gcc build preferred?

Jun 25 '22 18:06 ventsislav-georgiev

Not sure why the build fails for you. We don't disable CGO in our CI builds, nor do we disable it during development. The reason it's preferred is that you get proper DNS resolution, not the Go variant which have given us some grief in the past.

Jun 25 '22 18:06 thallgren

I usually just do make push-image with the TELEPRESENCE_REGISTRY set to a local registry.

Jun 25 '22 18:06 thallgren

Probably something specific to arm64: /usr/local/go/pkg/tool/linux_arm64/link: running gcc failed I've noticed that telepresense doesn't have arm images. The tel2 and the ambassador-telepresence-agent are amd64 and run through emulation on arm.

Jun 25 '22 19:06 ventsislav-georgiev

Maybe this is related to this issue.

I defined a securtyContext on spec.template.spec level which will be used for the init containers too. After i removed the securityContext the init-container was not crashing anymore.

apiVersion: apps/v1
kind: StatefulSet
spec:
  template:
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000

Maybe you can define the securityContext on the init container to run as root.

Jul 07 '22 10:07 rekor001

Yeah I'm running into the same error on Openshift 4.10 w/ Telepresence 2.7.2. Tried granting all the permissions possible with no change.

2022-08-25 23:14:45.8931 info    Traffic Agent Init v1.12.10
2022-08-25 23:14:45.8957 error   failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.

error: failed to clear chain TEL_INBOUND_TCP: running [/sbin/iptables -t nat -N TEL_INBOUND_TCP --wait]: exit status 3: iptables v1.8.7 (legacy): can't initialize iptables table `nat': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.

Looks like the iptables kernel module isn't loaded because they use nftables.

I found an old discussion from Istio about it.

https://github.com/istio/istio/issues/13986

Aug 25 '22 23:08 teejaded

Same here. I use Rancher Desktop on my M2 MacBook Air and the version of telepresence is v2.7.2.

Aug 30 '22 08:08 chengleqi

any solution for this?

Sep 06 '22 22:09 shivam-sood89-admin

telepresence telepresence copied to clipboard

v2.6.x crashes on tel-agent-init container start

telepresence
telepresence copied to clipboard