balena-raspberrypi
balena-raspberrypi copied to clipboard
RPi4 experiencing network stalls
(creating this on behalf of @hedss):
** note **: this stall was also reproduced on an RPi4 running Raspbian, so may be related to the kernel rather than the OS.
Networking: SCP of a 1GB file on a remote server.
Pi4:
file.txt 100% 1024MB 4.8MB/s 03:32
real 3m33.283s
user 0m55.190s
sys 0m44.288s
Pi3:
file.txt 100% 1024MB 7.0MB/s 02:26
real 2m27.182s
user 0m50.461s
sys 0m37.936s
These were run several times, and all runs on both showed these were indeed average times for download.
Whilst actually downloading, the speed is shown as around 7.0MB/s. However, there are noticeable long ‘pauses’ where download progress just stalls, before picking up again, which the Pi3 does not exhibit. tcpdump shows the same, that data flow just ‘stops’ for several seconds every once in a while.
SCPing the file locally, with a hop of one (from a local machine on the same network), shows a speed increase as expected, but also shows the same pauses on the Pi4 but not on the Pi3. It averages 9.6MB/s, but whilst transferring shows a regular speed of around 19MB/s, only the stalls bring this down.
The Pi3 averages 9.0MB/s consistently, but then it has slower networking, so this is not a surprise.
media I/O: To see if this was media I/O bound (ie. stalling on writing to the card), I ran dd to create a 1GB file on both Pis, which are using the same model/make of SD card (Sandisk Extreme Ultra). These are class 10 cards, and the write speed has a sustained a minimum of 10MB/s write speed.
These tests were run several times, and the average output shown below is consistent.
Pi4:
dd if=/dev/zero of=file.txt count=1024 bs=1048576
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 70.2829 s, 15.3 MB/s
Pi3:
dd if=/dev/zero of=file.txt count=1024 bs=1048576
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 116.628 s, 9.2 MB/s
As you can see, the Pi3 is pretty close to the minimum sustained, whilst the Pi4 exceeds it significantly. This proves, given the SCP speeds, that it’s most definitely not media I/O bound and the issue does seem to be networking, either driver or kernel.
Both units on are on the same switch, Pi3 was 2.38, Pi4 was 2.41.
During the course of the I/O tests, I saw one instance where balenaEngine on the Pi4 aborted due to a 6minute watchdog timeout (although the containers were functional throughout). It did not restart and required manually restarting to continue.
[xginn8] This issue has attached support thread https://jel.ly.fish/#/support-thread~463c3599-4177-4e47-8a4b-e87f4d0dbd6d
@xginn8 That's wonderful, thank you ever so much for raising this for me, I was a bit swamped! This issue is correct, and as exactly the same stalling is seen under Raspbian, we can assume this is a kernel issue.
Hi! Any progress on debugging this issue? It blocks us from using a Pi4, and we're eager to get started using it. :)
@rjhuijsman not yet, we have been working on other things but we'll probably look at this in about 1 week time so stay tuned
Thank you! Looking forward to that!
@rjhuijsman I'd recommend also to report this to Raspbian as we have been able to reproduce it on the foundation's official distribution as well.
@agherzan oh wow. That's a good idea, and it sounds like you have a much better grip on what's going on than I do - would you like to do the honors? :)
Probably @floion would the man to do it mainly because I'm not directly engaged in this project anymore. But even from a user perspective you can try to reproduce the tests in the description of this issue and summarize it in a report on forums or rpi linux kernel repo.
Hi Balena folks! Any progress on this issue?
Alternatively, any workarounds on offer? Are other people not running into this because they have 64-bit containers only?
Hi. Not yet. We'll look at this in the next following days
Hey @floion and friends! Any discoveries on this recently?
Hi @rjhuijsman! Did some investigation with a setup of the PI4 connected to my PC over ethernet 1000Mb/s (Cat 5e cable, auto-negotiation enabled on both and advertised mode of 1000Mb/s on both) and had speeds of approx. 50 MB/s going down to 30 MB/s. I did see some stalls and checked that with valgrind. I noticed that it was getting stuck in some crypto code, so I added the encrypt algorithm (-c) parameter to scp, to override the default, and didn't see any stalls afterward. The scp version (openssh/openbsd) in /etc/ssh/sshd_config
seems to be v1.102 on both boards mentioned by xginn8 (Pi3 running 2.38, Pi4 running 2.41) so the only difference might be that the RPI4 is using some 64bit libs that the v1.102 version of scp has some issues when it comes to encryption.
We will look into this further and update this issue.
Kind regards
Hi @vicgal - thanks for the investigation! Any further progress in the past few weeks?
Interesting to see the crypto code as suspect! We've seen network stalls in apt-get install
and pip install
too; any thoughts on whether those might be through the same crypto-culprit, or whether there's a deeper underlying issue at play?
@vicgal Thanks for the detailed investigation! This originally came up because we were seeing massively delayed stalls on application downloads, so if this is crypto, then I assume the code used for these fetches is also using it. But this means it's definitely wider than just SCP, and may indeed be SSL related in general, so we do need to trace that down. Cheers!
Hi @rjhuijsman is the download speed slowness observed on wifi or on wired?
We observed the long network stalls on both wired and wireless connections.
For ease of reference, here's the original report of the issue in the support conversation that led to this thread:
It seems the RPi4 is having difficulty with its network connections.
Potential red flag: we're using a 32 bit base image, since we're dependent on a proprietary library that's not yet available in 64 bit. Specifically:
FROM balenalib/raspberrypi3-debian:stretch-build
[...]
Here's why I suspect network issues:
The speedtests (using iperf3) look normal. However, under heavy network use I consistently see downloads failing:
* `npm install` fails, throwing an error that the internet says is because of a broken network connection.
* `yarn add` reports "you seem to be having network issues"
* While the above happens I lose the `balena push <ip>` connection with an error I reported separately in a different support thread.
* When not in local mode, any Balena image downloads (mine are large, at ~1.5GB) make it to ~10% and then restart from 0%.
Surprisingly, this seems to happen both on WiFi and while using a wired connection.
I've tried this on three separate (reliable) internet connections, so that's not the issue either. Also, the RPi3 was successful on those same connections.
Eyeballing with `top` from the host OS, the CPU usage looks normal: high while compiling packages, but low during downloads while these issues happen... Both Pi3 and Pi4 get hot during this process, but not absurdly so - the Pi4 even has heatsinks and a fan.
Thanks @rjhuijsman . I have merged in https://github.com/balena-os/balena-raspberrypi/pull/435 which from my tests on scp'ing over wiffi 1 GB of file to the rpi4 brought in an improvement of doubling the speed. So this will be available in the next release. I will notify you when that is available so that you can test again on wifi so then we can concentrate on what is left.