artiq
artiq copied to clipboard
Cannot access coredevice across subnets after DHCP feature merge
Bug Report
One-Line Summary
For gateware/firmware built against an ARTIQ-7 revision after (including) c60de48 (smoltcp update and DHCP feature), the coredevice cannot be accessed across subnets.
This is a regression compared to gateware built against 06ad76b.
Issue Details
The coredevice is configured with a static IP XX.YY.0.137. With gateware/firmware built against 06ad76b, pings from XX.YY.0.5 (frames 1 to 4) as wells as from XX.YY.2.5 (frames 5 to 8) are successful:
No. Time Source Destination Protocol Length Info
1 0.000000 XX.YY.0.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x9f12, seq=0/0, ttl=64 (reply in 2)
No. Time Source Destination Protocol Length Info
2 0.000260 XX.YY.0.137 XX.YY.0.5 ICMP 98 Echo (ping) reply id=0x9f12, seq=0/0, ttl=64 (request in 1)
No. Time Source Destination Protocol Length Info
3 1.014239 XX.YY.0.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x9f12, seq=1/256, ttl=64 (reply in 4)
No. Time Source Destination Protocol Length Info
4 1.014473 XX.YY.0.137 XX.YY.0.5 ICMP 98 Echo (ping) reply id=0x9f12, seq=1/256, ttl=64 (request in 3)
No. Time Source Destination Protocol Length Info
5 5.509846 XX.YY.2.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0xe719, seq=0/0, ttl=64 (reply in 6)
No. Time Source Destination Protocol Length Info
6 5.510131 XX.YY.0.137 XX.YY.2.5 ICMP 98 Echo (ping) reply id=0xe719, seq=0/0, ttl=64 (request in 5)
No. Time Source Destination Protocol Length Info
7 6.522022 XX.YY.2.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0xe719, seq=1/256, ttl=64 (reply in 8)
No. Time Source Destination Protocol Length Info
8 6.522258 XX.YY.0.137 XX.YY.2.5 ICMP 98 Echo (ping) reply id=0xe719, seq=1/256, ttl=64 (request in 7)
With gateware/firmware built against d17675e (to this date, any revision after the DHCP feature), pings from the same subnet (frames 4 to 9) still succeed, with a small hickup in the beginning (frames 1 and 2), but pings from another subnet (frames 10 to 13) do not find their way back to the ping source:
No. Time Source Destination Protocol Length Info
1 0.000000 XX.YY.0.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x336a, seq=0/0, ttl=64 (no response found!)
No. Time Source Destination Protocol Length Info
2 0.000216 Microchi_aa:bb:cc Broadcast ARP 60 Who has XX.YY.0.5? Tell XX.YY.0.137
Frame 2: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
Hardware type: Ethernet (1)
Protocol type: IPv4 (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (1)
Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
Sender IP address: XX.YY.0.137
Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
Target IP address: XX.YY.0.5
No. Time Source Destination Protocol Length Info
3 0.000236 42 <Ignored>
Frame 3: 42 bytes on wire (336 bits), 42 bytes captured (336 bits)
This frame is marked as ignored
No. Time Source Destination Protocol Length Info
4 1.000802 XX.YY.0.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x336a, seq=1/256, ttl=64 (reply in 5)
No. Time Source Destination Protocol Length Info
5 1.001035 XX.YY.0.137 XX.YY.0.5 ICMP 98 Echo (ping) reply id=0x336a, seq=1/256, ttl=64 (request in 4)
No. Time Source Destination Protocol Length Info
6 4.704100 XX.YY.0.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x0d73, seq=0/0, ttl=64 (reply in 7)
No. Time Source Destination Protocol Length Info
7 4.704337 XX.YY.0.137 XX.YY.0.5 ICMP 98 Echo (ping) reply id=0x0d73, seq=0/0, ttl=64 (request in 6)
No. Time Source Destination Protocol Length Info
8 5.706916 XX.YY.0.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x0d73, seq=1/256, ttl=64 (reply in 9)
No. Time Source Destination Protocol Length Info
9 5.707163 XX.YY.0.137 XX.YY.0.5 ICMP 98 Echo (ping) reply id=0x0d73, seq=1/256, ttl=64 (request in 8)
No. Time Source Destination Protocol Length Info
10 15.979134 XX.YY.2.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x9613, seq=0/0, ttl=64 (no response found!)
No. Time Source Destination Protocol Length Info
11 15.979358 Microchi_aa:bb:cc Broadcast ARP 60 Who has XX.YY.2.5? Tell XX.YY.0.137
Frame 11: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
Hardware type: Ethernet (1)
Protocol type: IPv4 (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (1)
Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
Sender IP address: XX.YY.0.137
Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
Target IP address: XX.YY.2.5
No. Time Source Destination Protocol Length Info
12 16.995054 XX.YY.2.5 XX.YY.0.137 ICMP 98 Echo (ping) request id=0x9613, seq=1/256, ttl=64 (no response found!)
No. Time Source Destination Protocol Length Info
13 16.995268 Microchi_aa:bb:cc Broadcast ARP 60 Who has XX.YY.2.5? Tell XX.YY.0.137
Frame 13: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
Hardware type: Ethernet (1)
Protocol type: IPv4 (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (1)
Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
Sender IP address: XX.YY.0.137
Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
Target IP address: XX.YY.2.5
The firewall configuration is unchanged between the two settings above. Only the coredevice gateware/firmware was updated. The faulty behavior persists when doing TCP requests instead of ICMP ones.
Given the captured ARP requests, it seems that the gateway is not configured properly on the coredevice. This faulty behavior is unchanged when using ip=use_dhcp.
If this is the issue, is there a way to set the gateway in the static IP case? For the DHCP case, I would expect the gateway to be broadcast by the DHCP server and set accordingly.
Your System (omit irrelevant parts)
- Operating System: n/a
- ARTIQ version: n/a
- Version of the gateware and runtime loaded in the core device:
7.0.06ad76b.betaand7.0.d17675e.beta - Hardware involved: Kasli v1.1
@mbirtwell
I'll try and take a look at this today or tomorrow.
Reverted for release-7
So it seems like this wasn't really intended to be supported by smoltcp. smoltcp used to have a feature where it would fill the neighbour cache from any packet that it saw to try and avoid unnecessary ARPs. But that was removed because it caused problems if there were certain buggy devices also on the network. See commit and PR.
The artiq firmware configures the IP address with 0 prefix bits. Effectively claiming that we're on the same sub-net as the entire internet. Which when coupled with the above smoltcp feature meant that every packet received would add an entry to the neighbour cache even if they weren't strictly speaking neighbours. So a packet that had been routed on to a subnet from another subnet would result in a neighbour cache entry mapping the origins IP to the routers MAC address. Again not strictly correct, but good enough to make this work in your case.
So options are:
- Ask smoltcp if we can have the automatic neighbour cache population back again, might be possible with some extra filtering of the candidates like requiring it to have a unicast destination address. Or only doing it for packet that are addressed to us. I'll raise an issue on smoltcp.
- Adding default route support. This'll still break for people on upgrade if they don't set a default route, but at least they can then do that to fix it. It should be easy to have the default route set from DHCP if you're using that.
- Stay stuck on an old smoltcp version
- Write this off as not supported. It seems like a bit of an accident that it worked in the first place to me.
That's roughly in order of my personal preference and there's a big gap between options 2 and 3.
The artiq firmware configures the IP address with 0 prefix bits.
This is the issue. You should configure the smoltcp device with the correct prefix length: XX.YY.0.137/24. Then, when it wants to send packets to XX.YY.2.5, it'll see it's an out-of-subnet IP and send it to the default route instead (say, XX.YY.0.1), which knows how to route it.
If the default route is not in the ARP cache it'll find it with a Who has XX.YY.0.1? Tell XX.YY.0.137 ARP request. It should never do an ARP request for an out-of-subnet IP like the Who has XX.YY.2.5? Tell XX.YY.0.137 you're seeing now.
Except that there is no default route, nor currently any support for setting one.
Then you should add it! :)
Thanks for the investigation! This look promising.
It is a quite fortunate accident that cross-subnet access used to work and is now quite central to our workflow. @mbirtwell do you have capacity to work on a resolution?
@airwoodix You can use release 7 in the meantime.
The behavior was not at all an accident and smoltcp was explicitly written to support this. It just turned out to be too fragile. @airwoodix and @mbirtwell what are your plans here?
I think doing the work to add the default gateway configuration is the best way forwards. I don't mind doing that, but it's not likely to be soon.
I also don't really have capacity to work on this at the moment. We track release-7 for now to work around it.
I've managed to make a start of on this: https://github.com/m-labs/artiq/commit/d0fe2c5cc863f50c542e165f3c9538a2e8669db7, but it's not tested yet.