Qubes loses network access on restart of `sys-net`
Qubes OS release
R4.2.1
Brief summary
Whenever sys-net is restarted, network access via WiFi or ethernet cable is no longer possible.
Steps to reproduce
Restart sys-net via Qube Manager or CLI and try to establish a network connection.
Expected behavior
The previously enabled network devices should work again after sys-net has completed its startup.
Actual behavior
After restarting sys-net, for WiFi, a message is shown that the connection is established, but no access to other networks is possible. For Ethernet cable, the icon in the panel shows a red circulating dot. Both types of interfaces cannot be started manually, and any attempt to connect to an external node, like ping from a sys-net terminal will fail, possibly with a message like Network not available.
Booting Qubes restores network access.
This happens for sys-net based on Fedora-39, Fedora-40, and Debian-12. So it seems that this is not a bug in sys-net itself, but possibly in dom0 or the communication with dom0.
journalctl does not show any errors in sys-net or dom0.
This effect is specific to Qubes R4.2.1. Under Qubes R4.1.2, sys-net can be restarted without problems on the same machine. Both are running kernel 6.9.4.
On Qubes R4.2.1, Fedora-40-xfce, and kernel 6.9.2, with wifi only, I can't reproduce this.
sys-net often loses connectivity after suspend, but restarting the qube restores it.
On Qubes R4.2.1, Fedora-40-xfce, and kernel 6.9.2, with wifi only, I can't reproduce this.
sys-net often loses connectivity after suspend, but restarting the qube restores it.
@marmarek The above would justify the additional OpenQA unit test I suggested here in last paragraph.
For me, after suspending and restarting Qubes, the network is still running. So this seems to be something different.
This seems to be a hardware-dependent issue, because the effect occurs on an HP Elitebook, but not on other hardware - but why does it occur in Qubes R4.2.1 and not in R4.1.2?
Do you have the same PCI attach options in Qubes OS 4.1 and 4.2?
Maybe you have no-strict-reset and/or permissive option/s set only in one system?
Did network work in sys-net after restart in Qubes OS 4.2 before on this machine? Or is it a first time that you've installed Qubes OS 4.2 on this machine (or tried restarting sys-net)? You can also try to use the same kernel in Qubes OS 4.2 that you have in Qubes OS 4.1. Maybe it's an issue with latest kernel.
The PCI attach options in both R4.1 and R4.2 are identical. On top of that, setting or unsetting the no-strict-reset option causes no change in the behavior in both versions.
I have been running the test installation of R4.2 for some time, but, unfortunately, I have never tried to restart sys-net previously.
Both systems are running with the same kernel 6.9.4, but going back to an earlier kernel, e.g. 6.6.25, does not change the situation.
Did you check /var/log/xen/console/guest-sys-net-dm.log in dom0? Maybe there will be some info.
@alimirjamali:
The above would justify the additional OpenQA unit test I suggested here in last paragraph.
I suggest opening a separate issue for adding such tests.
I did not find anything suspicious in /var/log/xen/console/guest-sys-net-dm.log. The logs for both the working and the faulty start look the same, without any serious error.
The problems seem to be caused by a PCI device "Intel Ethernet Connection (4) I219.V". If this device is attached to sys-net then all network access is blocked on restart of the VM, even if other network devices are attached. Detaching the Intel Ethernet controller from sys-net allows to restart the VM and preserves the network connection via other network devices.
As on booting Qubes, all network devices are working, even the Intel controller, I suppose that somehow a reset of the Intel controller ruins its functionality and blocks the other network devices. Configuring no-strict-reset, however, has no effect on this behavior.
I am closing this issue since it seems to be very hardware-specific and so without general importance.
Intel has this annoying approach of needing the Wi-Fi/audio-controller firmware loaded onto the device from the OS during each hardware initialization.
Maybe...try changing the OS template used by sys-net? Or maybe update the Intel firmware package in the template used by sys-net.
B
I tested with Fedora-39, Fedora-40, and Debian-12 as templates for sys-net, and they all behaved the same. On the other hand, the error does not occur in Qubes R4.1.2 with Fedora-39 as a template for sys-net. So perhaps it might be more of a problem with dom0, which has moved from Fedora-32 to Fedora-37. But I kept dom0 fully updated. So what???
I am re-opening this issue because further tests, inspired by investigations of D. Martin - thank you for this support! - have shown that we have a version-dependent incompatibility between the current kernel and dom0, which may be a symptom of a deeper problem.
The error occurs whenever sys-net is restarted under Qubes R4.2.x, if Qubes was booted with a current kernel (6.9.7 in my tests). The kernel used in sys-net itself is of no importance, and the error occurs if sys-net is based on Fedora-39, Fedora-40, or Debian-12 as template., even when using an old kernel like 6.1.96 in these templates and in sys-net itself.
If Qubes R4.2.x is booted with an old kernel (6.1.96 in my tests), sys-net can be restarted without error even if it is based on a current kernel like 6.9.7.
Qubes R4.1.2 does not show this error and can be used with the current kernel 6.9.7 in dom0 and/or sys-net.
So there seems to be some problem that Qubes R4.2.x has with newer kernels and which did not exist in R4.1.2. As this is a regression, it should, in my opinion, be investigated.
Some more tests showed an even more complicated and unreliable behavior: If sys-net is shut down and then started again, the behavior is as described above. Performing a restart of sys-net from the Qube Manager, however, leads to the error under Qubes R4.2.x, even if both dom0 and sys-net are based on the old kernel. When the error has occurred once, all is lost, and network access can only be re-established by booting Qubes.
Similar issue reported here: https://forum.qubes-os.org/t/networkmanager-issues-with-wired-connection-since-upgrade-to-qubes-4-2/27635
This seems to be caused by the kernel used by sys-net, independent of the kernel used in dom0. Although my R4.1.2 installation uses kernel 6.9.7 in dom0, the kernel version of sys-net is 6.1.75, and this combination is (mainly) working in Qubes R4.2.x.
So I wonder if this issue is caused by the kernel version used in sys-net and independent from the kernel version in dom0. According to the investigations of D. Martin, something seems to have happened between kernel versions 6.6.21 and 6.6.29 which is somewhere between the working and the faulty behavior of sys-net, and, unfortunately, R4.2 comes with the newer, faulty VM kernels.
Should this now be regarded as an upstream problem and thus be closed, especially as it is hardware-dependent?
Should this now be regarded as an upstream problem and thus be closed, especially as it is hardware-dependent?
I don't know, but I can at least add the waiting for upstream label.
As yeriill found out, removing the ethernet patch cable and re-attaching it after a few seconds restores the functionality of sys-et. One more hint that this is a hardware issue of this particular controller.
Closing this issue as the last update for dom0 fixed this bug: sys-net can now be restarted without losing network access, even if the Intel ethernet controller stays connected. The responsible change might be the upgrade of Linux firmware to the version of August 11, 2024.
I am reopening the issue because it has reappeared in a new installation of Qubes R4.2.3-rc1, while the installation of R4.2 upgraded to R4.2.3-rc1 is still working correctly.
Both systems use the Linux firmware dated August 11.
The only differences I see are that the network icon in the taskbar is grey in the working system, but red in the new, defective installation, and that the correctly working system has sys-net based on fedora-40 whereas the new, faulty version is based on fedora-40-xfce.
Switching sys-net in the new, faulty installation, from fedora-40-xfce to fedora-40 does not fix the error, and the icons stays red.
The problem is fixed in the installer of Qubes R4.2.3. A fully updated installation of this version now works correctly. So I am closing the issue again. :+1: