runner-images Very frequent network timeouts (especially on Windows)

Very frequent network timeouts (especially on Windows)

Open ItalyPaleAle opened this issue 2 years ago • 17 comments

Description

We (Dapr project) are seeing very weird issues with our GitHub Actions runners, both Windows and Linux, where the network seems very unstable.

For example, all these failures happened with the last 2 days:

Lots of timeouts while trying to connect to external endpoints. Over the last 2 days, these Actions failed while trying to push a Docker image to an Azure Container Registry (all from Windows agents):
- https://github.com/dapr/dapr/runs/7117849420?check_suite_focus=true
- https://github.com/dapr/dapr/runs/7103356688?check_suite_focus=true
- https://github.com/dapr/dapr/runs/7123975875?check_suite_focus=true
More timeouts trying to connect to external HTTP(S) endpoints (all from Linux agents - note even when tests below say "Windows", the agents are Linux-based):
- https://github.com/dapr/dapr/runs/7135574988?check_suite_focus=true
- https://github.com/dapr/dapr/runs/7138967236?check_suite_focus=true
- https://github.com/dapr/dapr/runs/7121442720?check_suite_focus=true
- https://github.com/dapr/dapr/runs/7108051120?check_suite_focus=true
- https://github.com/dapr/dapr/runs/7106417880?check_suite_focus=true

This is not the first time we've had issues with timeouts with network requests from our Actions runners. A few months ago we made significant improvements when we realized that we were being impacted by SNAT port exhaustion, so we modified our tests to be better at reusing TCP connections. We are at the point where we can't do much more.

The network instability is a significant challenge for us right now, as it's causing a lot of flakiness in our tests.

Platforms affected

[ ] Azure DevOps
[X] GitHub Actions

Virtual environments affected

[ ] Ubuntu 18.04
[X] Ubuntu 20.04
[ ] Ubuntu 22.04
[ ] macOS 10.15
[ ] macOS 11
[ ] macOS 12
[X] Windows Server 2019
[ ] Windows Server 2022

Image version and build link

20220626.1

Examples in the description

Is it regression?

no?

Expected behavior

Network should be more stable

Actual behavior

Network is very unstable

Repro steps

Tests are flaky

Jun 30 '22 23:06 ItalyPaleAle

Hello @ItalyPaleAle. We will take a look at it.

Jul 01 '22 07:07 al-cheb

@ItalyPaleAle hello and sorry for delays!

I have not found anything suspicious playing with acr and few more services, have you had problems since the report was opened? Could you also please add the following snippets to your pipeline?

- run: |
            sudo tcpdump -nn -i any -w sntp.cap &
            sleep 1
           <your problematic steps here>

and then


 - name: Upload capture
          if: always()
          run: |
            sleep 1
            sudo kill -2 $(pgrep tcpdump)
            sleep 1
            sudo curl -F "[email protected]" https://file.io/?expires=1w

and attach your dump somewhere

(we did something similar in the https://github.com/actions/virtual-environments/issues/5615) that way we could retrieve a potentially problematic dump

Jul 12 '22 10:07 mikhailkoliada

Thanks @mikhailkoliada I will add that to our pipelines.

What's the equivalent command for Windows?

Jul 12 '22 16:07 ItalyPaleAle

@mikhailkoliada please see these tests that failed with network timeouts. TCP dumps have been collected and you can find them as artifacts (see the linux_amd64_tcpdump or windows_amd64_tcpdump).

Note that all tests run on Linux agents even those that are called "windows":

https://github.com/dapr/dapr/actions/runs/2678522535 -> Windows failed
https://github.com/dapr/dapr/actions/runs/2677333835 -> Windows failed

Jul 15 '22 22:07 ItalyPaleAle

@ItalyPaleAle Hello and sorry for delay! It took me a while to investigate large logs. I have not identified network problems regarding runners, so I think it is possible to be azure problem and as azure is a big SDN network we will not be able to catch problems on our end

cast @chkimes here, Chad. would be glad if you take a double look, might be I am missing something.

Jul 26 '22 10:07 mikhailkoliada

Some of the timeouts in the logs appear to be HTTP related, e.g.:

Client.Timeout exceeded while awaiting headers

However others do look to be TCP timeouts. From one of the packet captures, it does pretty clear look like SNAT exhaustion. Here's a breakdown of connections to remote endpoints:

      1 185.125.188.54:443
      1 20.106.111.189:3000
      1 20.106.113.129:3000
      1 20.106.115.189:3000
      1 20.125.110.152:3000
      1 20.125.111.125:3000
      1 20.125.111.25:3000
      1 20.125.85.254:3000
      1 20.125.86.10:3000
      2 172.217.14.81:443
      2 20.106.80.164:3000
      2 20.118.154.106:3000
      2 20.125.111.18:3000
      2 20.125.111.70:3000
      2 20.150.154.231:3000
      2 20.25.175.246:3000
      3 20.106.104.195:3000
      3 20.106.116.122:3000
      3 20.106.117.27:3000
      3 20.106.86.93:3000
      3 20.118.152.42:3000
      3 20.125.110.141:3000
      3 20.125.110.189:3000
      3 20.125.111.203:3000
      3 20.150.153.111:3000
      3 20.150.225.130:443
      4 13.33.21.70:443
      4 185.199.111.153:443
      4 20.118.152.143:3000
      4 20.125.110.159:3000
      4 20.125.111.213:3000
      4 20.125.66.60:3000
      5 20.25.175.233:3000
     13 20.125.110.151:3000
     13 20.125.110.194:3000
     23 20.106.115.124:3000
     34 13.107.42.16:443
     38 169.254.169.254:80
     74 20.40.25.138:443
    541 168.63.129.16:80
   1171 168.63.129.16:32526

1k connections to the same endpoint in a short window will absolutely trigger exhaustion. This run was ~50 minutes long, though, and I haven't looked at the concentration of when connections are established but just given the volume I believe that to be the most likely culprit. The insidious thing about SNAT exhaustion is that a closed connection still reserves a port for 4 total minutes to ensure TCP state is fully cleared along the path, so even ensuring low numbers of concurrent connections isn't a solution if connections are being closed and re-established at a high rate.

We are taking on some work that should result in higher SNAT limits for our VMs, but the timeline for delivery is not short (my estimate is 6-9 months, but perhaps sooner).

Jul 26 '22 19:07 chkimes

Thanks for looking into the captures and for confirming.

Our tests do indeed make a lot of requests to the same endpoint. The runtime of the actual tests is around 30-35 mins (excluding the initial setup), and new connections do tend to come in bursts.

We are now aware of SNAT exhaustion being an issue on the Actions agents. Three months ago, right before we learnt about that even being a thing, we attempted a fix that actually made the issue worse by not reusing any TCP socket-and that's how we discovered the issue.

At this point, we've done essentially all we can do on our application, short of moving the test runner directly into the K8s cluster (which we have considered, but it would be a fairly sizable effort).

Do you have any suggestion we can look into to try and reduce the number of ports we use?

Jul 26 '22 20:07 ItalyPaleAle

How many new processes are you creating during your testing? Connections will get reused within the scope of an existing process (and HTTP Client or Transport instance), but those connections can't be shared across processes.

Also what is your concurrency level when running tests? In HTTP 1.0/1.1, connections will be re-used only if they are not currently in use. In HTTP 2.0 it is dependent on your particular client.

Jul 28 '22 19:07 chkimes

A very crude and likely undesirable solution is to put a few minutes delay between steps that are heavy on connection usage, or organize the test order such that steps with heavy connection usage are spread out across the total execution time. Closed connections will reserve a port for 4 minutes, but after that the port should be free for reuse.

Jul 28 '22 19:07 chkimes

@chkimes thanks for the feedback. I will discuss this with my team and we'll see what we can do.

Making tests slower would be a last-resort option. Even if we didn't have to pause for a full 4 minutes (perhaps 1 min between tests would be helpful already), we have invested a lot into making our tests faster, and this would be a move in the other direction.

An alternative approach I am considering is to create a tunnel to the K8s cluster. This way the Actions runner would have only one (persistent) connection to the cluster and wouldn't make thousands of calls. However it is to be seen how reliable that tunnel would be.

Jul 31 '22 17:07 ItalyPaleAle

An alternative approach I am considering is to create a tunnel to the K8s cluster. This way the Actions runner would have only one (persistent) connection to the cluster and wouldn't make thousands of calls. However it is to be seen how reliable that tunnel would be.

Using a tunnel is clever, and it seems to me like it should work well for reducing outbound connections. The downside is pushing all your test traffic through a single endpoint, but if you're not bandwidth constrained then I bet it would be pretty effective. There are Azure native solutions that you could use, or you could easily spin up a VM with wireguard and plop it onto the same vnet as your cluster.

Aug 01 '22 21:08 chkimes

I will let you know if we can do that and how it works. We shouldn't be constrained in terms of bandwidth: we make lots of requests but the payloads are very small (usually a few bytes at most).

I was thinking something like https://github.com/ivanmorenoj/k8s-wireguard which doesn't even require deploying a VM in the VNet.

Aug 01 '22 21:08 ItalyPaleAle

An interesting fact is that we started to see similar issues in the last few days. Sample: https://github.com/hetznercloud/terraform-provider-hcloud/pull/560/checks It makes the usage of Github Actions not possible for us anymore. So is there maybe a solution? In this case, we can't reach our API anymore that generates an API token for further tests. We have this issue on multiple projects and can confirm that it is not related to the API itself.

Aug 23 '22 19:08 LKaemmerling

@LKaemmerling I tried to look at those checks, but they seemed to pass. Was there a specific check suite you had in mind? I see this one which had a failure:

https://github.com/hetznercloud/terraform-provider-hcloud/runs/7966084022?check_suite_focus=true

That's a TCP timeout that appears to have happened before any significant work had been done on the VM, which is very unusual unless there's a wider spread network outage (which we don't have any evidence of). What is the tt-service.hetzner.cloud endpoint, and how convinced are you of its availability?

[tt-service.hetzner.cloud](curl: (28) Failed to connect to tt-service.hetzner.cloud port 443: Connection timed out)

Aug 24 '22 21:08 chkimes

I think I might be experiencing this issue as well. Tcpdump can be found here: https://github.com/mac-chaffee/APSVIZ-Supervisor/actions/runs/3010474179

But I make less connections than the Dapr folks (less than 100 total new connections, counted by searching for SYNs in the packet trace). Do multiple Github Actions runners sometimes share the same NAT gateway? Meaning that a "noisy neighbor" could use up all the SNAT ports? Sounds like that is possible from my uninformed reading of the docs here: https://docs.microsoft.com/en-us/azure/virtual-network/nat-gateway/troubleshoot-nat-connectivity#outbound-connectivity-not-scaled-out-enough

Sep 08 '22 18:09 mac-chaffee

Do multiple Github Actions runners sometimes share the same NAT gateway? Meaning that a "noisy neighbor" could use up all the SNAT ports?

They currently do not, though we don't make any guarantees about future designs for SNAT port allocation. I think in any case we'd want to make sure each job has some number of minimum SNAT ports available. Today that number should be 1024 (depending on Azure's also not guaranteed current implementation).

I see 57 new connection attempts in that packet capture, so I don't think this is consistent with SNAT allocation failures. Typically this behavior is very consistent, and when networking issues deviate from that behavior we have often seen red herrings caused by target endpoint configuration rather than in Actions networking. Do you control this endpoint, or do you know if there are any firewalls or DDoS protection running?

Sep 13 '22 16:09 chkimes

Hmm very interesting, thanks for clarifying!

Yes I control the destination network, no firewalls or DDoS protection. Essentially just a router and a switch straight to a physical machine. A packet trace on that machine shows the packets never arrive.

I'll do more searching, thanks!

Sep 13 '22 21:09 mac-chaffee

Hello! We have not observed the problem for awhile, please report again if you think there is something more we could look into!

Jun 14 '23 10:06 mikhailkoliada

runner-images runner-images copied to clipboard

Very frequent network timeouts (especially on Windows)

Description

Platforms affected

Virtual environments affected

Image version and build link

Is it regression?

Expected behavior

Actual behavior

Repro steps

runner-images
runner-images copied to clipboard