runner-images UDP networking unstable on macos and linux environments

Description

Repo: https://github.com/cesanta/mongoose

GA actions runs unit test which makes a request to the time.google.com SNTP server, via UDP to synchronise time. On macos environment, that request fails the majority of time - even if the unit test is patched to repeat the request, all subsequent UDP requests seem to be dropped. Only a restart may fix that, and sometimes does it after several attempts.

The most flaky environment is macos, and recently linux started to show that behavior.

Note: TCP requests to the external hosts seem to be fine. Only UDP has problems.

Example of failed test (this case, macos environment failed): https://github.com/cesanta/mongoose/actions/runs/2388366976

Virtual environments affected

[ ] Ubuntu 18.04
[X] Ubuntu 20.04
[ ] Ubuntu 22.04
[ ] macOS 10.15
[X] macOS 11
[ ] macOS 12
[ ] Windows Server 2019
[ ] Windows Server 2022

Image version and build link

https://github.com/cesanta/mongoose/actions/runs/2388366976

Macos:

Current runner version: '2.291.1'
Operating System
Virtual Environment
  Environment: macos-11
  Version: [2](https://github.com/cesanta/mongoose/runs/6603556798?check_suite_focus=true#step:1:2)022051[7](https://github.com/cesanta/mongoose/runs/6603556798?check_suite_focus=true#step:1:8).4
  Included Software: https://github.com/actions/virtual-environments/blob/macOS-11/20220517.4/images/macos/macos-11-Readme.md

Linux:

Current runner version: '2.291.1'
Operating System
  Ubuntu
  [2](https://github.com/cesanta/mongoose/runs/6607962215?check_suite_focus=true#step:1:2)0.0[4](https://github.com/cesanta/mongoose/runs/6607962215?check_suite_focus=true#step:1:4).4
  LTS
Virtual Environment
  Environment: ubuntu-20.04
  Version: 20220[5](https://github.com/cesanta/mongoose/runs/6607962215?check_suite_focus=true#step:1:5)15.1
  Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20220515.1/images/linux/Ubuntu2004-Readme.md
  Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20220515.1

Is it regression?

No response

Expected behavior

UDP requests to time.google.com must go through.

Note: the same test, same requests, same everything, work flawleslly on Windows environment - and most of the time on Linux environment. They almost always fail on Macos environemnt, and sometimes on Linux environment.

Actual behavior

No response

Repro steps

Send UDP packets to time.google.com

May 26 '22 10:05 cpq

Hello @cpq, thank you reporting this one. We will take a look.

May 26 '22 12:05 AlexanderGitin

Thank you.

Here is an example of Linux environment failing to talk to time.google.com via UDP:

https://github.com/cesanta/mongoose/actions/runs/2390068186

There are multiple tests failed. Exactly the same tests succeeded on windows and macos environment on this run.

May 26 '22 12:05 cpq

Is this a problem that is happening with all UDP traffic, or is it isolated to time.google.com? For example, what happens if you use time.windows.com instead?

May 26 '22 14:05 chkimes

@chkimes, thank you for looking into this.

DNS works just fine, which is UDP over port 53. SNTP fails intermittently, which is UDP over port 123.

I have added one extra test to query udp://time.windows.com:123 before querying google Here is the run: https://github.com/cesanta/mongoose/actions/runs/2391532806

It fails on time.windows.com too. The same test executes fine from a local workstation, and within the same run on windows environment.

May 26 '22 15:05 cpq

Thanks - that's very interesting. Can you run a local VM packet capture while the run is executing? You can add these steps to the workflow:

- name: Start tcpdump
  run: |
    sudo tcpdump -nn -i any port 123 -w sntp.cap &
    sleep 1

# your steps here

- name: Upload capture
  if: always()
  run: |
    sleep 1
    sudo kill -2 $(pgrep tcpdump)
    sleep 1
    sudo curl -F "[email protected]" https://file.io/?expires=1w

The workflow logs will output a file.io link you can use to download the capture. Alternatively, you can upload it as an artifact of the job.

May 26 '22 16:05 chkimes

Done, Run: https://github.com/cesanta/mongoose/runs/6613139507?check_suite_focus=true Capture link: https://www.file.io/gScD/download/j3Z58xRmiLcz - which is an empty capture, no packets captured.

In order to double-check that tcpdump works, I've done a second run with port 53 included : https://github.com/cesanta/mongoose/runs/6613208125?check_suite_focus=true The saved file shows port 53 capture, but not port 123. So yeah, tcpdump works.

Does that mean that the traffic does not hit the interface? It looks like it gets dropped somewhere on the way. From the app point of view, socket send() is successful, it does not return any error. 34 bytes of SNTP request sent:

4fa90 3 sock.c:166:mg_send              2 0x3 0:0 34 err 2 (No such file or directory)   // 34 is the send() return code

May 26 '22 17:05 cpq

I think file.io is deleting the files once you've downloaded them, so I'm unable to view them. However if there are no packets in one and packets in another then that is definitely suspicious of client-side behavior (unrelated to hosting environment). We should at least see outbound packets.

MacOS hosting and networking is quite different, however Windows and Linux are extremely similar. That there are differences between the latter two also indicates something client-side. However I'm not immediately sure what.

Is the application picking a specific source port for the UDP traffic? Or is it using any OS-assigned high port?

May 26 '22 20:05 chkimes

No specific port is chosen, so it's a whatever ethemeral port an OS assigns. I haven't seen such issue with DNS resolution, which also picks up an ethemeral port. In all those tests, first time.google.com gets resolved, then immediately a UDP packet to it gets sent (using different UDP socket) DNS gets through, SNTP does not. So I'd be surprised if the issue is about the source port.

May 26 '22 20:05 cpq

The source port was a long shot, but easy enough to check. One of the older SNTP specs says to specify src and dst:

https://datatracker.ietf.org/doc/html/rfc2030#section-4

   Both NTP and SNTP are clients of the User Datagram Protocol (UDP)
   [POS80], which itself is a client of the Internet Protocol (IP)
   [DAR81]. The structure of the IP and UDP headers is described in the
   cited specification documents and will not be detailed further here.
   The UDP port number assigned to NTP is 123, which should be used in
   both the Source Port and Destination Port fields in the UDP header.

If an NTP implementation was already listening on port 123, the kernel would block picking that as the source port. This also would not get captured by tcpdump since the packet never hits and interface. However, if you're using ephemeral high ports then that's not a concern.

Using the SNTP tool, I can see that communications to port 123 are behaving as I might expect on Linux:

https://github.com/chkimes/test-sntp/runs/6627294882?check_suite_focus=true

This result, coupled with the different behavior on Linux and Windows (where, from a virtualization standpoint, the networking layer is identical), to me reinforces the idea that networking is behaving as expected. As for why this doesn't work now but used to work - I don't have an immediate answer. It is very likely related to a new image version, of which the most likely candidates in my mind are:

A new kernel with different behavior for some specific edge case.
Newly installed or updated software that is somehow blocking your specific behavior.
Newly updated dependency libraries that no longer work with your specific behavior.

May 27 '22 14:05 chkimes

@chkimes thank you!

Actually SNTP version 4, which was used in the first place, allows source port to be anything.

I did some more debugging and I am getting somewhere. Apparently, the select() and poll() syscalls ignore the timeout value. In the log below, a DNS request is sent, then an app falls into a loop doing poll(). On each iteration, a timestamp in ms is printed. The timeout value is 50 ms. Notice that in the log, the timestamp does not change:

1810710e59e 2 sock.c:129:iolog          
-- 2 0.0.0.0:60086 -> 8.8.8.8:53  33
0000   00 01 01 00 00 01 00 00 00 00 00 00 04 74 69 6d   .............tim
0010   65 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00   e.google.com....
0020   01                                                .               
1810710e59e 3 sock.c:498:mg_iotest      n=2 ms=50   <-- just before  poll(), with timeout of 50 ms
1810710e59e 4 sock.c:580:mg_mgr_poll    2 -- tchrc
1810710e59e 4 sock.c:580:mg_mgr_poll    1 -- tchRc
1810710e59e 3 sock.c:498:mg_iotest      n=2 ms=50     <--- next poll() iteration - but the timestamp is the same
1810710e59e 4 sock.c:580:mg_mgr_poll    2 -- tchrc
1810710e59e 4 sock.c:580:mg_mgr_poll    1 -- tchR
1810710e59e 3 sock.c:498:mg_iotest      n=2 ms=50    <---- still the same!
1810710e59e 4 sock.c:580:mg_mgr_poll    2 -- tchrc
1810710e59e 4 sock.c:580:mg_mgr_poll    1 -- tchRc
1810710e59e 3 sock.c:498:mg_iotest      n=2 ms=50    <--- still the same! so instead of waiting for response, we busy-loop
...

The app has a limit on the number of iterations, so after 50 iteration is stops. Obviously, an sntp test program does not have such count limit, and it succeeds, cause it waits for a timeout based on a timestamp, not on a iteration limit.

When I run the test on a workstation, the poll() behaves correctly and actually waits for a requested 50ms. Notice the timestamp change in the log, and a correct DNS resolution coming through:

-- 2 0.0.0.0:57101 -> 8.8.8.8:53  33
0000   00 01 01 00 00 01 00 00 00 00 00 00 04 74 69 6d   .............tim
0010   65 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00   e.google.com....
0020   01                                                .               
18107163925 3 sock.c:498:mg_iotest      n=2 ms=50     <--- before poll()
18107163942 4 sock.c:580:mg_mgr_poll    2 r- tchrc      <--- poll() returned, and see: the timestamp is different, poll() was waiting
18107163942 3 sock.c:291:read_conn      2 0x3 0:0 97 err 2 (No such file or directory)
18107163942 2 sock.c:129:iolog          
-- 2 0.0.0.0:57101 <- 8.8.8.8:53  97
0000   00 01 81 80 00 01 00 04 00 00 00 00 04 74 69 6d   .............tim
0010   65 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00   e.google.com....
0020   01 c0 0c 00 01 00 01 00 00 25 81 00 04 d8 ef 23   .........%.....#
0030   0c c0 0c 00 01 00 01 00 00 25 81 00 04 d8 ef 23   .........%.....#
0040   08 c0 0c 00 01 00 01 00 00 25 81 00 04 d8 ef 23   .........%.....#
0050   00 c0 0c 00 01 00 01 00 00 25 81 00 04 d8 ef 23   .........%.....#

So, apparently, GA images ignore timeout value for select() and poll() syscalls, and return immediately. Is there any reason for such behavior?

May 27 '22 19:05 cpq

Thanks - that's very interesting! After a brief scan of your code, it looks like you're checking revents for POLLIN and POLLOUT, but the manpage says that POLLERR, POLLHUP, or POLLNVAL could be returned before the timeout. Could you dump the value of revents in these scenarios?

May 27 '22 20:05 chkimes

@chkimes , I have to apologise.

The issue we saw was due to the bug in the app, and it was a regression after a switch from select() to poll(). The root cause was in the way a list of pollfds was prepared:

  for (struct mg_connection *c = mgr->conns; c != NULL; c = c->next, i++) {
    if (c->is_closing || c->is_resolving || FD(c) == INVALID_SOCKET) {
      // Socket not valid, ignore                        <<<<<----- SOME FDs in the array are ignored
    } else {
      fds[i].fd = FD(c);

That loop populated fds[] array, but it skipped some fds, leaving fds[0].fd == 0, and poll() reported it as ready, thus falling into a busy loop. The fix was avoid gaps in the fds[] array and keep only those fds that must be tested.

Now I hope the only issue is left with that intermittent macos failures, but I guess we can live with that.

Apologies again, and thanks for you support and attention.

May 27 '22 21:05 cpq

Glad we were able to get to a resolution on Linux!

Is the behavior similar in the MacOS failures? What kind of intermittent failure rate are you seeing there?

May 27 '22 21:05 chkimes

It's about 50/50 - now I am throwing printf()s there to see more precisely what happens - I'll report soon.

May 27 '22 21:05 cpq

Interesting.

https://github.com/cesanta/mongoose/runs/6632199560?check_suite_focus=true

time.windows.com DNS-resolved and SNTP-ed successfully. Then, time.google.com DNS-resolved successfully, but failed to SNTP. In this case, poll() loop shows correct time increment. Can we do tcpdump trick on macos too?

1810774bec4 3 net.c:183:mg_connect      1 -1 udp://time.google.com:123
1810774bec4 3 net.c:183:mg_connect      2 -1 udp://8.8.8.8:53
1810774bec4 3 sock.c:167:mg_send        2 0x3 0:0 33 err 2 (No such file or directory)
1810774bec4 2 sock.c:131:iolog          
-- 2 0.0.0.0:55529 -> 8.8.8.8:53  33
0000   00 01 01 00 00 01 00 00 00 00 00 00 04 74 69 6d   .............tim
0010   65 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00   e.google.com....
0020   01                                                .               
1810774bec7 3 sock.c:501:mg_iotest      n=1, ms=50, errno=2
1810774bec7 3 sock.c:292:read_conn      2 0x3 0:0 97 err 2 (No such file or directory)
1810774bec7 2 sock.c:131:iolog          
-- 2 0.0.0.0:55[529](https://github.com/cesanta/mongoose/runs/6632199560?check_suite_focus=true#step:4:530) <- 8.8.8.8:53  97
0000   00 01 81 80 00 01 00 04 00 00 00 00 04 74 69 6d   .............tim
0010   65 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00   e.google.com....
0020   01 c0 0c 00 01 00 01 00 00 38 40 00 04 d8 ef 23   .........8@....#
0030   08 c0 0c 00 01 00 01 00 00 38 40 00 04 d8 ef 23   .........8@....#
0040   04 c0 0c 00 01 00 01 00 00 38 40 00 04 d8 ef 23   .........8@....#
0050   0c c0 0c 00 01 00 01 00 00 38 40 00 04 d8 ef 23   .........8@....#
0060   00                                                .               
1810774bec7 3 dns.c:167:dns_cb          1 time.google.com is 216.239.35.8
1810774bec7 3 sock.c:167:mg_send        1 0x5 0:0 48 err 2 (No such file or directory)
1810774bec7 2 sock.c:131:iolog          
-- 1 0.0.0.0:56734 -> 216.239.35.8:123  48
0000   23 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   #...............
0010   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
0020   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
1810774bf65 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774bffb 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c095 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c0fb 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c14b 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c1d1 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c276 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c2b0 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c33f 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c403 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c451 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c4b6 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2
1810774c512 3 sock.c:501:mg_iotest      n=2, ms=50, errno=2

May 27 '22 21:05 cpq

Yep! I believe the same tcpdump command should work without changes on MacOS.

May 27 '22 21:05 chkimes

Got the capture. https://github.com/cesanta/mongoose/runs/6632447540?check_suite_focus=true

MS site worked, google - not. Request sent, but got no response:

May 27 '22 22:05 cpq

Thanks for that! That packet shows in the trace at least, so we can see the client is trying to send it. Understanding how far the packet got or whether a response packet was sent will, unfortunately, be a pretty difficult task.

I will note that MacOS is quite different from Linux/Windows from a virtualization and networking standpoint, however it's not different in a way that I would expect it to behave differently for different endpoints/ports.

Since I don't have an immediate suspect, I think our best path forward is to generate some data and see if we can find patterns. I would ask you to run tcpdump in your tests on MacOS for a while, and we can use that data to correlate it with the failures. I can combine that with some of our backend telemetry to get things like what region a job ran in or what kind of hardware it was on.

In order to avoid managing a bunch of file uploads, we can instead just spit the output into the logs:

sudo tcpdump -i any -nn -vvv port 123 &

May 27 '22 22:05 chkimes

Thank you @chkimes , much appreciated. I'll keep observing. Shall I close this bug until I get more data, or keep it open?

May 27 '22 22:05 cpq

Let's keep it open, since the MacOS issue is still ongoing.

May 27 '22 22:05 chkimes

So far, that's the stats. Those lines with "ok" are successful. Those without - failed. I don't see any pattern yet. Maybe the google server itself is failing some queries? The failure rate is very high though.

-- 1 0.0.0.0:52173 -> 216.239.35.0:123 48 |  
-- 1 0.0.0.0:50867 -> 216.239.35.12:123 48 |  
-- 1 0.0.0.0:52752 -> 216.239.35.0:123 48 | ok
-- 1 0.0.0.0:54102 -> 216.239.35.4:123 48 |  
-- 1 0.0.0.0:59659 -> 216.239.35.0:123 48 | ok
-- 1 0.0.0.0:50732 -> 216.239.35.12:123 48 |  
-- 1 0.0.0.0:62883 -> 216.239.35.0:123 48 |  
-- 1 0.0.0.0:49551 -> 216.239.35.8:123 48 | ok

May 30 '22 18:05 cpq

Thanks for generating the data @cpq! I investigated the runs from backend telemetry and I have found a pretty clear correlation with which datacenters the jobs are running on. I'm going to spend some time to investigate what the differences are between these.

May 31 '22 16:05 chkimes

@chkimes thank you! I guess, a command-line sntp tool might be used for testing .. Interesting that DNS works fine, and SNTP to microsoft site works too. It's just SNTP to time.google.com is flaky.

May 31 '22 17:05 cpq

An update: it actually seems that only time.windows.com works for this particular DC. Other servers like time.nist.gov or clock.isc.org work from my home computer but not from MacOS in that DC. Linux/Windows work fine, though. Also strange is that curl time.google.com works like you'd expect from MacOS.

I haven't yet identified the root cause, but it seems like SNTP specifically is being blocked if it's outside of the MS network. I'm waiting on followup from some network engineers to learn more.

May 31 '22 20:05 chkimes

Update: I'm still waiting on investigation from the network team.

Jun 03 '22 17:06 chkimes

Thank you for the update @chkimes !

Jun 03 '22 17:06 cpq

I got confirmation that there is a network ACL blocking port 123 that is unexpectedly being applied to our traffic. I'm still working with the network team on a path forward (so that the rule applies to what is intended, but excludes Mac VM traffic), but we at least understand the root cause.

Jun 06 '22 14:06 chkimes

@chkimes thank you for your effort, much appreciated!

Jun 07 '22 02:06 cpq

FYI - recently I see a much higher rate of failures, now including linux environments. E.g. this invocation required 2 re-runs to clear failures: https://github.com/cesanta/mongoose/actions/runs/2459995427/attempts/1

Jun 08 '22 12:06 cpq

The network team will be working on a change to isolate Mac traffic from this network rule. Not clear on rollout date, but I will update when I hear about it.

Linux environments should not be impacted by this particular rule, so we may have some additional investigation to do since the root cause is likely not the same. FYI I will be out on leave starting tomorrow until Tuesday.

Jun 08 '22 17:06 chkimes

runner-images runner-images copied to clipboard

UDP networking unstable on macos and linux environments

Description

Virtual environments affected

Image version and build link

Is it regression?

Expected behavior

Actual behavior

Repro steps

runner-images
runner-images copied to clipboard