apiserver-network-proxy konnectivity fails on kubectl cp

We see issues with the kubectl cp functionality with version 0.0.23 (and earlier).

In my test, I'm copying 3M files to a pod, but the on the third or fourth copy is always failing and then nothing else is working with that konnectivity agent.

Command:

kubectl cp /tmp/testfile-3M.bin test-k8s-cp-ds-mzlbb:/tmp/testfile-3M.bin -n my-namespace

When a successful copy ends, I see the following logs in the agent:

I0826 13:53:51.986610       1 client.go:482] connection EOF
I0826 13:53:51.986665       1 client.go:400] "close connection" connectionID=6
I0826 13:53:52.021504       1 client.go:444] "received CLOSE_REQ" connectionID=6
I0826 13:53:52.021568       1 client.go:450] "Failed to find connection context for close" connectionID=6

When it fails, I don't see the connection EOF message, and the agent still writes some logs (for different connectionIDs), but the kubectl cp hangs. Then when I try to get some logs, I get the following error:

➜ kubectl logs -n kube-system konnectivity-agent-bvzf6 --tail 10
Error from server: Get "https://10.188.233.5:10250/containerLogs/kube-system/konnectivity-agent-bvzf6/konnectivity-agent?tailLines=10": net/http: TLS handshake timeout

Once I restart the agent, everything goes back to normal.

I think it is easy to reproduce the problem, and I could share more configuration option and pprof if necessary. Please give me some suggestion how to tackle this issue.

We have customers who are having this issues and we need to provide some correction.

Thank you! Adam

Aug 26 '21 16:08 mihivagyok

The problem also comes once I try to copy a 20M file to a container.

user@virtualserver01:~/home/tests/pvg# head -c "20M" </dev/urandom > 20M
user@virtualserver01:~/home/tests/pvg# kubectl cp 20M test-k8s-cp-ds-mzlbb:/tmp/20M -npvg-armada-network-k8s-cp
E0826 17:08:02.772597    6087 v2.go:105] write tcp 169.x.x.x:42940->169.y.y.y:30719: write: connection reset by peer

Then trying some logs:

➜ kubectl logs -n kube-system konnectivity-agent-ztbrj
Error from server: Get "https://10.188.233.5:10250/containerLogs/kube-system/konnectivity-agent-ztbrj/konnectivity-agent": net/http: TLS handshake timeout

In the log, 169.y.y.y is the server address, the 169.x.x.x is my virtual server IP address. More configuration detail:

agent runs in container network (CNI is calico)
it has one identifier: --agent-identifiers=ipv4=$(HOST_IP)
it connects to the master using a DNS name which resolvest to 1 IP, but sometimes to a different one

Thank you!

Aug 26 '21 17:08 mihivagyok

Here is what I have found so far:

our konnectivity server has the following strategies: destHost,default
agents have the identifier as ipv4=$(HOST_IP) as I posted above

Once I remove this agent identifier, the kubectl cp starts working fine, no issue even with big files. I have a test case which executes several copy, and so far it is successful now.

It is strange to me, but seems somehow the destHost strategy has some bug - later on I will check that code part.

Thank you!

Aug 27 '21 13:08 mihivagyok

Sorry right now just a passing comment. Will dig in later. Seems clear that default should return a backend irrespective of what destHost does. However if the agent identifier is changing that, it sounds like something is wrong. My best guess is that something about having an agent identifier is preventing the backend (agent) from being registered with the default backend manager. Also curious what the IP is for the test-k8s-cp-ds-mzlbb pod. I'm guessing is does not match agent identifier.

Aug 27 '21 16:08 cheftako

Hi! Yes, the identifer is the node private IP where the agent is running, e.g. 10.10.42.22 The agent runs in the pod network and the test pod also, so both has an address like 172.32.2.x.

Do you think when a kubectl cp, the destination host is the target pod IP? In case of logs, the destination host is the node IP where the pod runs (as we go to kubelet).

Thank you!

Aug 27 '21 17:08 mihivagyok

Do you think when a kubectl cp, the destination host is the target pod IP? In case of logs, the destination host is the node IP where the pod runs (as we go to kubelet).

I'm just guessing. I haven't spent much time playing with kubectl cp. However having it sent to the pod IP could explain not getting a backend from the destHost manager. You could try looking for the pod IP in the konn server log. I would expect to see it in the log if thats where it was being sent. Even if true there is still something clearly wrong as the default manager should still have handled this. Should have time to reproduce and look later.

Aug 27 '21 17:08 cheftako

@mihivagyok Out of curiosity how many agents do you have?

Aug 27 '21 19:08 PratikDeoghare

Hi! In the this test, I have two workers, so there are 2 agents. Thank you!

Aug 28 '21 05:08 mihivagyok

Just tried to repro locally using the following commands.

./bin/http-test-server
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key --admin-server-port=9094 --health-server-port=9095 --agent-identifiers=ipv4=127.0.0.1
./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key --agent-identifiers=ipv4=172.17.0.1
./bin/proxy-server --server-port=0 --uds-name=/tmp/uds-proxy --cluster-ca-cert=certs/agent/issued/ca.crt --cluster-cert=certs/agent/issued/proxy-frontend.crt --cluster-key=certs/agent/private/proxy-frontend.key --proxy-strategies=destHost,default
./bin/proxy-test-client --proxy-port=0 --proxy-uds=/tmp/uds-proxy --proxy-host="" --request-host=127.0.0.1

So UDS + GRPC and two clients. No luck so far. Does seem like something about either data volume or the cp command. Plan on trying on a proper K8s cluster tomorrow. What version of server and client are you using?

Aug 29 '21 05:08 cheftako

@cheftako Thank you for trying! From your test, I don't see the "copy" part, but let me share some differences:

We use HTTP-CONNECT between the Konnectivity server and client (kube API server)
Yes, I think you need a kubernetes cluster for this problem, we have integrated Konnectivity to Kubernetes 1.21
The problem is that the copy hangs after some tries or with bigger files. This also tells me that the copy starts (backend selection is fine), but it never ends.
But once I remove the agent-identifier I could copy as many as big files I want, it works so far without problem. This is very stange, I know - but also tells me that Kubernetes/kubectl is fine.

Thank you! Adam

Aug 30 '21 06:08 mihivagyok

Few more breadcrumbs. (Still working on narrowing down the repro). Kubectl -v=9 shows the following as the request.

I0829 23:38:51.775518 184241 round_trippers.go:405] POST https://X.X.X.X/api/v1/namespaces/kube-system/pods/kube-proxy-kubernetes-minion-group-xxxx/exec?command=tar&command=xf&command=-&command=-C&command=%2Ftmp&container=kube-proxy&container=kube-proxy&stderr=true&stdin=true&stdout=true 101 Switching Protocols in 228 milliseconds I0829 23:38:51.775568 184241 round_trippers.go:411] Response Headers: I0829 23:38:51.775589 184241 round_trippers.go:414] Connection: Upgrade I0829 23:38:51.775606 184241 round_trippers.go:414] Upgrade: SPDY/3.1 I0829 23:38:51.775618 184241 round_trippers.go:414] X-Stream-Protocol-Version: v4.channel.k8s.io I0829 23:38:51.775629 184241 round_trippers.go:414] Date: Mon, 30 Aug 2021 06:38:51 GMT

KAS logs show the following:

I0830 06:39:01.873427 10 client.go:107] "[tracing] recv packet" type="DATA" I0830 06:39:01.873488 10 client.go:107] "[tracing] recv packet" type="CLOSE_RSP" I0830 06:39:01.873509 10 clientconn.go:435] [core] Channel Connectivity change to SHUTDOWN I0830 06:39:01.873520 10 clientconn.go:1114] [core] Subchannel Connectivity change to SHUTDOWN I0830 06:39:01.873757 10 conn.go:57] "[tracing] send req" type="DATA" I0830 06:39:01.873781 10 conn.go:115] closing connection I0830 06:39:01.873789 10 conn.go:125] "[tracing] send req" type="CLOSE_REQ" I0830 06:39:01.873865 10 httplog.go:105] "HTTP" verb="POST" URI="/api/v1/namespaces/kube-system/pods/kube-prox y-kubernetes-minion-group-xxxx/exec?command=tar&command=xf&command=-&command=-C&command=%2Ftmp&container=kube-proxy &container=kube-proxy&stderr=true&stdin=true&stdout=true" latency="10.168115135s" userAgent="kubectl/v0.0.0 (linux/ amd64) kubernetes/$Format" audit-ID="****" srcIP="X.X.X.X:57902" hijacked=true I0830 06:39:01.873931 10 controlbuf.go:521] [transport] transport: loopyWriter.run returning. connection error : desc = "transport is closing"

Aug 30 '21 06:08 cheftako

OK tried with agent-identifiers destHost,default,grpc and 3 agents. No problem I could see. Tomorrow will try switching to http-connect. Myabe we're hitting a size limit trying to support SPDY and http-connect. (Not sure where agent-id falls in that)

Aug 30 '21 07:08 cheftako

Yes, okay, I try to investigate it too. We have bunch of customers who are complianing, so I still believe the problem is real. Thank you!

Aug 30 '21 07:08 mihivagyok

@cheftako If you are open to a webex session, I could show how we configure Konnectivity and the problem itself. Thank you!

Aug 30 '21 13:08 mihivagyok

@mihivagyok I would be open to a webex session. Feel free to reach out on Slack to arrange some time.

Sep 01 '21 16:09 cheftako

Worth checking out https://github.com/kubernetes/kubernetes/issues/60140#issuecomment-898466684

Sep 02 '21 06:09 cheftako

Thanks, I read it, although I think our problem is a bit different, because once I change the config, everything works well. I will look for you on slack.

Thank you! Adam

Sep 02 '21 06:09 mihivagyok

Thanks for meeting with me and showing me more details of how to repro. I have now been able to repro, although its still seems like a less frequent problem on my test system. Having reproduced on my system I was able to look at the relevant agent logs. The last things I see in the agent log for the relevant connection is

2021-09-03T18:18:08.354627866Z stderr F I0903 18:18:08.354328       1 client.go:508] "write to remote" connectionID=4 lastData=4096

the agent log continues after that but I see nothing more for connectionID=4.

Going to keep digging.

Sep 03 '21 18:09 cheftako

Bread crumbs. Managed to get goroutine dumps on the Konnectivity Server and Konnectivity Agent after I repro'd the issue. I then subracted the goroutine dumps from the same processes when they were 'idle'. Here are the diffs.

1 @ 0x43ae45 0x44c057 0x93291a 0x947dd9 0x9a3462 0x13cbee9 0x13ce11f 0x13dad9b 0x6e9523 0x6e4a4d 0x471d01
#	0x932919	google.golang.org/grpc/internal/transport.(*writeQuota).get+0x79			/go/pkg/mod/google.golang.org/[email protected]/internal/transport/flowcontrol.go:59
#	0x947dd8	google.golang.org/grpc/internal/transport.(*http2Server).Write+0x1f8			/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:927
#	0x9a3461	google.golang.org/grpc.(*serverStream).SendMsg+0x241					/go/pkg/mod/google.golang.org/[email protected]/stream.go:1421
#	0x13cbee8	sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectServer).Send+0x48	/go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:147
#	0x13ce11e	sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send+0x7e			/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:88
#	0x13dad9a	sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP+0xb5a		/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:149
#	0x6e9522	net/http.serverHandler.ServeHTTP+0xa2							/usr/local/go/src/net/http/server.go:2887
#	0x6e4a4c	net/http.(*conn).serve+0x8cc								/usr/local/go/src/net/http/server.go:1952

1/3 @ 0x43ae45 0x43349b 0x46c275 0x4da265 0x4db355 0x4db337 0x5c1e0f 0x5d4271 0x6268c3 0x50c03e 0x626b33 0x623975 0x629d45 0x629d50 0x6de759 0x5110a2 0x13dac65 0x6e9523 0x6e4a4d 0x471d01
#	0x46c274	internal/poll.runtime_pollWait+0x54						/usr/local/go/src/runtime/netpoll.go:222
#	0x4da264	internal/poll.(*pollDesc).wait+0x44						/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4db354	internal/poll.(*pollDesc).waitRead+0x1d4					/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4db336	internal/poll.(*FD).Read+0x1b6							/usr/local/go/src/internal/poll/fd_unix.go:166
#	0x5c1e0e	net.(*netFD).Read+0x4e								/usr/local/go/src/net/fd_posix.go:55
#	0x5d4270	net.(*conn).Read+0x90								/usr/local/go/src/net/net.go:183
#	0x6268c2	crypto/tls.(*atLeastReader).Read+0x62						/usr/local/go/src/crypto/tls/conn.go:776
#	0x50c03d	bytes.(*Buffer).ReadFrom+0xbd							/usr/local/go/src/bytes/buffer.go:204
#	0x626b32	crypto/tls.(*Conn).readFromUntil+0xf2						/usr/local/go/src/crypto/tls/conn.go:798
#	0x623974	crypto/tls.(*Conn).readRecordOrCCS+0x114					/usr/local/go/src/crypto/tls/conn.go:605
#	0x629d44	crypto/tls.(*Conn).readRecord+0x164						/usr/local/go/src/crypto/tls/conn.go:573
#	0x629d4f	crypto/tls.(*Conn).Read+0x16f							/usr/local/go/src/crypto/tls/conn.go:1276
#	0x6de758	net/http.(*connReader).Read+0x1b8						/usr/local/go/src/net/http/server.go:800
#	0x5110a1	bufio.(*Reader).Read+0x141							/usr/local/go/src/bufio/bufio.go:213
#	0x13dac64	sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP+0xa24	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:129
#	0x6e9522	net/http.serverHandler.ServeHTTP+0xa2						/usr/local/go/src/net/http/server.go:2887
#	0x6e4a4c	net/http.(*conn).serve+0x8cc							/usr/local/go/src/net/http/server.go:1952

4 @ 0x43aec5 0x4068cf 0x40654b 0x934c05 0x4710e1
#	0x934c04	sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).proxyToRemote+0x4c4	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:503

4 @ 0x43aec5 0x43351b 0x46b9f5 0x4d2bc5 0x4d3cb5 0x4d3c97 0x5a646f 0x5b88d1 0x9341d8 0x4710e1
#	0x46b9f4	internal/poll.runtime_pollWait+0x54						/usr/local/go/src/runtime/netpoll.go:222
#	0x4d2bc4	internal/poll.(*pollDesc).wait+0x44						/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4d3cb4	internal/poll.(*pollDesc).waitRead+0x1d4					/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4d3c96	internal/poll.(*FD).Read+0x1b6							/usr/local/go/src/internal/poll/fd_unix.go:166
#	0x5a646e	net.(*netFD).Read+0x4e								/usr/local/go/src/net/fd_posix.go:55
#	0x5b88d0	net.(*conn).Read+0x90								/usr/local/go/src/net/net.go:183
#	0x9341d7	sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).remoteToProxy+0xd7	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:478

Sep 09 '21 00:09 cheftako

grpc quota writer is known to block writes if the other end has stopped (or is to slowly) reading. Need to work out if that's whats happening, which is the other end and see why.

Sep 09 '21 00:09 cheftako

grpc quota writer is known to block writes if the other end has stopped (or is to slowly) reading. Need to work out if that's whats happening, which is the other end and see why.

Hi @cheftako could you be more specific about the grpc quota writer? What's the impact to ANP? I suspect the #268 is similar with this one.

Sep 27 '21 02:09 ydp

@cheftako did you happen to find anything on this? We are still seeing problems.

Oct 14 '21 03:10 relyt0925

This pr aims to help start fixing the problem (by randomizing across a pool of multiple redundant instances that can serve the same IPs/hosts) https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/272/files

Oct 15 '21 01:10 relyt0925

Let me share our use-case and how we use the agent-identifier.

So idea is the following:

when kubernetes API server asks for logs, it goes to the node's kubelet, e.g. https://10.0.2.12:10250/logs...
the IP 10.0.2.12 is the node private IP which is registered as an agent-identifier ipv4=10.0.2.12
this also means, there is always one agent which handles a specific host - in this case random does not help
why we want to have this? it is always good to route directly to the right agent if we know the destination. For a multizone cluster, it is also beneficial to not send the traffic between the zones.

I'm also trying to figure out what is happening here and I have some interesting findings what I cannot comment. The main issue we have found the issue is that we gave too low CPU limits for the Agents (50m) based on this https://github.com/kubernetes/kubernetes/blob/v1.21.0/cluster/gce/addons/konnectivity-agent/konnectivity-agent-ds.yaml#L49

But that limit is way to low. If I remove the limit, basically I cannot re-pro the issue anymore. My test is the following: copy files with different sizes to a pod:

  for i in 1 3 5 10 20 30 40 50
  do
    log "Copy to pod: ${pod_to_copy}"
    head -c "${i}m" </dev/urandom > "${test_file}-${i}M"
    kubectl cp "${test_file}-${i}M" "${pod_to_copy}:/tmp/${test_file}-${i}M"
    kubectl exec "${pod_to_copy}" -- ls /tmp | grep -q "${test_file}-${i}M"
  done

When the CPU limit is:

50m - it always fails
200m - it fails with the bigger files
1 core - it works well so far
no limit - it works well so far

I'm also checking the metrics (kubectl top), but it does not show big usage - around ~200m.

Agent settings:

      - --agent-identifiers=ipv4=$(HOST_IP)

Server settings:

        - --mode=http-connect
        - --proxy-strategies=destHost,default

Version is 0.0.24.

What I still don't understand: when the limit is low (50m) and the destHost proxy strategy is not used, it works even with big files - of course, it is quite slow, but works. I'm still trying to figure out how is that possible. Or do have any idea about that?

Thank you! Adam

Oct 15 '21 16:10 mihivagyok

I have new test results which shows the following:

using --proxy-strategies=destHost,default, no CPU limit in Konnectivity agents, the kubectl cp test is faling after 4 run
using --proxy-strategies=default, no CPU limit in Konnectivity agents, the kubectl cptest is successful even after the 50 runs

I will execute the same test when the cluster has one node only - to see that if destHost has any effect in that case.

Thank you! Adam

Oct 17 '21 15:10 mihivagyok

Okay, I see the following:

in the konnectivity-server the last transfer which starts fine:

I1021 13:51:28.314586       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1021 13:51:28.314643       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1021 13:51:28.314907       1 tunnel.go:78] "Set pending(rand=900783470661715446) to &{0xc000572320 0xc000210100 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc0001dac40 map[] true true} map[] false 0 -1 200 false false [] 0 [84 104 117 44 32 50 49 32 79 99 116 32 50 48 50 49 32 49 51 58 53 49 58 50 56 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc000550d20 0}"
I1021 13:51:28.314968       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"
I1021 13:51:28.315076       1 tunnel.go:103] "DIAL_REQ sent to backend"
I1021 13:51:28.351718       1 server.go:712] "Received DIAL_RSP" random=900783470661715446 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.351798       1 server.go:269] "Register frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc00001ce00 CloseHTTP:0x143b6a0 connected:0xc00052c3c0 connectID:448 agentID:7e0f66a9-efde-44ec-8d13-1f9fd8345e31 start:{wall:13858836059699697597 ext:12587356314015 loc:0x223cee0} backend:0xc00053a198} agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.351835       1 tunnel.go:137] "Starting proxy to host" host="10.211.145.183:10250"
I1021 13:51:28.351943       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=254 totalBytes=254 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448

Transfer starts and this is the last log regarding the connectionID=448:

I1021 13:51:28.901338       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=4096 totalBytes=3628308 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.901367       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=4096 totalBytes=3632404 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.901414       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=4096 totalBytes=3636500 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448

Then there is nothing about the connection closing. From agent point of view:

I1021 13:51:28.333167       1 client.go:375] "received DIAL_REQ"
I1021 13:51:28.369890       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.369991       1 client.go:527] "write to remote" connectionID=448 lastData=254
I1021 13:51:28.415230       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.415279       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.415725       1 client.go:527] "write to remote" connectionID=448 lastData=2787
I1021 13:51:28.416603       1 client.go:527] "write to remote" connectionID=448 lastData=551
I1021 13:51:28.464363       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.464414       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.464431       1 client.go:438] "received DATA" connectionID=448
...
I1021 13:51:28.921415       1 client.go:527] "write to remote" connectionID=448 lastData=26
I1021 13:51:28.921416       1 client.go:438] "received DATA" connectionID=353
I1021 13:51:28.921473       1 client.go:527] "write to remote" connectionID=448 lastData=26
I1021 13:51:28.921491       1 client.go:527] "write to remote" connectionID=353 lastData=39
I1021 13:51:28.921500       1 client.go:527] "write to remote" connectionID=448 lastData=4096
I1021 13:51:28.921535       1 client.go:527] "write to remote" connectionID=448 lastData=4096
I1021 13:51:28.921594       1 client.go:527] "write to remote" connectionID=448 lastData=4096
I1021 13:51:28.921628       1 client.go:527] "write to remote" connectionID=448 lastData=4096

That's the end. I don't see closing logs either.

Then the next kubelet related request comes and I see only these logs:

I1021 14:02:46.886012       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1021 14:02:46.886053       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1021 14:02:46.886218       1 tunnel.go:78] "Set pending(rand=6648738534997005833) to &{0xc0005723c0 0xc0000be800 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc0001da7e0 map[] true true} map[] false 0 -1 200 false false [] 0 [84 104 117 44 32 50 49 32 79 99 116 32 50 48 50 49 32 49 52 58 48 50 58 52 54 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc000526fc0 0}"
I1021 14:02:46.886292       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"

Here I think it stuck in the grpc SendMsg message: https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/tunnel.go#L99

I have added a log message after this line, but it is not printing that.

I see also that other service traffic works:

I1021 13:56:34.135441       1 tunnel.go:42] "Received request for host" method="CONNECT" host="172.21.144.88:443" userAgent=""
I1021 13:56:34.135486       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1021 13:56:34.135657       1 tunnel.go:78] "Set pending(rand=212649966676544707) to &{0xc0005721e0 0xc0000be400 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc0003421c0 map[] true true} map[] false 0 -1 200 false false [] 0 [84 104 117 44 32 50 49 32 79 99 116 32 50 48 50 49 32 49 51 58 53 54 58 51 52 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc0004f19d0 0}"
I1021 13:56:34.135685       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="172.21.144.88"
I1021 13:56:34.135758       1 tunnel.go:103] "DIAL_REQ sent to backend"
I1021 13:56:34.152607       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=39 totalBytes=3953 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=449
I1021 13:56:34.172690       1 server.go:712] "Received DIAL_RSP" random=212649966676544707 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 13:56:34.172793       1 server.go:269] "Register frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc00001d880 CloseHTTP:0x143b6a0 connected:0xc0001ec540 connectID:453 agentID:7e0f66a9-efde-44ec-8d13-1f9fd8345e31 start:{wall:13858836388085405434 ext:12893177022994 loc:0x223cee0} backend:0xc00053a1b0} agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 13:56:34.172834       1 tunnel.go:137] "Starting proxy to host" host="172.21.144.88:443"
I1021 13:56:34.172906       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=296 totalBytes=296 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
...
I1021 14:02:04.612720       1 tunnel.go:148] "EOF from host" host="172.21.144.88:443"
I1021 14:02:04.612747       1 tunnel.go:177] "Stopping transfer to host" host="172.21.144.88:443" agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 14:02:04.649387       1 server.go:756] "Received CLOSE_RSP" serverID="0ecd9419-3bc4-4dce-8b9b-d87e23a61d96" agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 14:02:04.649512       1 server.go:766] "CLOSE_RSP sent to frontend" connectionID=453
I1021 14:02:04.649602       1 server.go:290] "Remove frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc00001d880 CloseHTTP:0x143b6a0 connected:0xc0001ec540 connectID:453 agentID:7e0f66a9-efde-44ec-8d13-1f9fd8345e31 start:{wall:13858836388085405434 ext:12893177022994 loc:0x223cee0} backend:0xc00053a1b0} agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 14:02:04.649629       1 server.go:769] "Close streaming" agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453

Agent finishes also fine:

I1021 14:02:04.631057       1 client.go:438] "received DATA" connectionID=453
I1021 14:02:04.631090       1 client.go:449] "received CLOSE_REQ" connectionID=453
I1021 14:02:04.631092       1 client.go:395] "close connection" connectionID=453

What do you think? @caesarxuchao @cheftako @jkh52

Thank you very much! Adam

Oct 21 '21 14:10 mihivagyok

I have also checked the log "Receive channel on Proxy is full", but it does not appear (yes, I have enabled in the parameters).

Thanks! Adam

Oct 21 '21 16:10 mihivagyok

A similar founding that GRPC itself might get stuck, like #268 , but I think they triggered in different places.

Oct 22 '21 00:10 ydp

Sorry I've been quiet on this one. I've added a few optimization on the channels but while they seem to let more data through on the kubectl cp its not fixing the problem. At this point it seems clear the issue has to do with the following stuck goroutine.

goroutine 1551 [select, 4 minutes]:
google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc0003985a0, 0x23)
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/flowcontrol.go:59 +0x74
google.golang.org/grpc/internal/transport.(*http2Server).Write(0xc000450000, 0xc0002a9680, {0xc0002cd520, 0x5, 0x5}, {0xc00030e7c0, 0x1e, 0x1e}, 0x0)
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:1037 +0x23b
google.golang.org/grpc.(*serverStream).SendMsg(0xc0000fb5c0, {0x1422700, 0xc0002add40})
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:1510 +0x1cc
sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectServer).Send(0x506225, 0xc0003d0800)
        /go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:198 +0x2b
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send(0xc00031f3e0, 0x6)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:88 +0x94
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP(0xc0003079a8, {0x16ad3a8, 0xc0004c07e0}, 0xc0000e0f00)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:166 +0x1a9a
net/http.serverHandler.ServeHTTP({0xc0002dc4e0}, {0x16ad3a8, 0xc0004c07e0}, 0xc0000e0f00)
        /usr/local/go/src/net/http/server.go:2878 +0x43b
net/http.(*conn).serve(0xc000231720, {0x16b0e98, 0xc00038ee10})
        /usr/local/go/src/net/http/server.go:1929 +0xb08
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:3033 +0x4e8

That would suggest https://github.com/grpc/grpc-go/issues/2427. However that essentially says the other end (the agent) is not reading. However in this replication of the issue I have exactly 1 agent and if I look at its goroutines I find the following.

goroutine 23 [select]:
google.golang.org/grpc/internal/transport.(*recvBufferReader).readClient(0xc00006b310, {0xc00000cfb8, 0x5, 0x5})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:190 +0xaa
google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc00006b310, {0xc00000cfb8, 0xc000028378, 0xc0002715b8})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:170 +0x147
google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc0001db110, {0xc00000cfb8, 0xc000271630, 0x855847})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:484 +0x32
io.ReadAtLeast({0xa98980, 0xc0001db110}, {0xc00000cfb8, 0x5, 0x5}, 0x5)
        /usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:347
google.golang.org/grpc/internal/transport.(*Stream).Read(0xc00010eea0, {0xc00000cfb8, 0x5, 0x5})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:468 +0xa5
google.golang.org/grpc.(*parser).recvMsg(0xc00000cfa8, 0x400000)
        /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:559 +0x47
google.golang.org/grpc.recvAndDecompress(0x4c1f05, 0xc00010eea0, {0x0, 0x0}, 0x400000, 0x0, {0x0, 0x0})
        /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:690 +0x66
google.golang.org/grpc.recv(0x2, {0x7f74a39086d8, 0xeb0338}, 0xc00010d080, {0x0, 0x0}, {0x9b8720, 0xc0000da900}, 0xc000271858, 0x0, ...)
        /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:758 +0x6e
google.golang.org/grpc.(*csAttempt).recvMsg(0xc00011ef20, {0x9b8720, 0xc0000da900}, 0xc000120180)
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:975 +0x2b0
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x203000)
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:826 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc000150600, 0xc000271a20, 0xc0002719f0)
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:680 +0x2f6
google.golang.org/grpc.(*clientStream).RecvMsg(0xc000150600, {0x9b8720, 0xc0000da900})
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:825 +0x11f
sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectClient).Recv(0xc0002a0100)
        /go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:159 +0x4c
sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).Recv(0xc0000b7940)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:285 +0xa5
sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).Serve(0xc0001680c0)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:367 +0x393
created by sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*ClientSet).syncOnce
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/clientset.go:197 +0x405

So it would appear it is calling RecvMesg. Seems like its time to engage the GRPC folks.

Oct 29 '21 00:10 cheftako

Hi All!

I have run my tests with the latest v0.0.26 (compiled with golang 1.16.10), and I still see the issue (although it has newer GRPC).

Thank you! Adam

Nov 24 '21 09:11 mihivagyok

@cheftako @andrewsykim @jkh52

I commented on #324 , but let me repeat it:

I think we have some idea why kubectl cp fails eventually: the issue comes when we have multiple proxy strategies. The backend (agent) connections are guarded by mutex. But if there are multiple strategies then the same backend with the same connection will be registered to multiple backend managers. To illustrate this:

BackendManager - DestHost
    - Backend1 (agent identifier, conn, mutex)
BackedManager - Default
    - Backend1* (conn, mutex*)

We have two agent instances (one for DestHost, one for Default backend manager), but with the same conn. In this case, it is possible that different go routines are trying to write to the same connection as there is no protection between the instances (there is mutex only within the instance), error could happen. I could submit an issue about this and discuss this theory there.

What do you think about this theory?

Thanks!

Feb 17 '22 17:02 mihivagyok

apiserver-network-proxy apiserver-network-proxy copied to clipboard

konnectivity fails on kubectl cp

apiserver-network-proxy
apiserver-network-proxy copied to clipboard