artiq icon indicating copy to clipboard operation
artiq copied to clipboard

Cannot access coredevice across subnets after DHCP feature merge

Open airwoodix opened this issue 3 years ago • 13 comments

Bug Report

One-Line Summary

For gateware/firmware built against an ARTIQ-7 revision after (including) c60de48 (smoltcp update and DHCP feature), the coredevice cannot be accessed across subnets.

This is a regression compared to gateware built against 06ad76b.

Issue Details

The coredevice is configured with a static IP XX.YY.0.137. With gateware/firmware built against 06ad76b, pings from XX.YY.0.5 (frames 1 to 4) as wells as from XX.YY.2.5 (frames 5 to 8) are successful:

No.     Time           Source                Destination           Protocol Length Info
      1 0.000000       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9f12, seq=0/0, ttl=64 (reply in 2)

No.     Time           Source                Destination           Protocol Length Info
      2 0.000260       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x9f12, seq=0/0, ttl=64 (request in 1)

No.     Time           Source                Destination           Protocol Length Info
      3 1.014239       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9f12, seq=1/256, ttl=64 (reply in 4)

No.     Time           Source                Destination           Protocol Length Info
      4 1.014473       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x9f12, seq=1/256, ttl=64 (request in 3)

No.     Time           Source                Destination           Protocol Length Info
      5 5.509846       XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0xe719, seq=0/0, ttl=64 (reply in 6)

No.     Time           Source                Destination           Protocol Length Info
      6 5.510131       XX.YY.0.137           XX.YY.2.5            ICMP     98     Echo (ping) reply    id=0xe719, seq=0/0, ttl=64 (request in 5)

No.     Time           Source                Destination           Protocol Length Info
      7 6.522022       XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0xe719, seq=1/256, ttl=64 (reply in 8)

No.     Time           Source                Destination           Protocol Length Info
      8 6.522258       XX.YY.0.137           XX.YY.2.5            ICMP     98     Echo (ping) reply    id=0xe719, seq=1/256, ttl=64 (request in 7)

With gateware/firmware built against d17675e (to this date, any revision after the DHCP feature), pings from the same subnet (frames 4 to 9) still succeed, with a small hickup in the beginning (frames 1 and 2), but pings from another subnet (frames 10 to 13) do not find their way back to the ping source:

No.     Time           Source                Destination           Protocol Length Info
      1 0.000000       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x336a, seq=0/0, ttl=64 (no response found!)

No.     Time           Source                Destination           Protocol Length Info
      2 0.000216       Microchi_aa:bb:cc     Broadcast             ARP      60     Who has XX.YY.0.5? Tell XX.YY.0.137

Frame 2: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
    Sender IP address:  XX.YY.0.137
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address:  XX.YY.0.5

No.     Time           Source                Destination           Protocol Length Info
      3 0.000236                                                            42     <Ignored>

Frame 3: 42 bytes on wire (336 bits), 42 bytes captured (336 bits)
This frame is marked as ignored

No.     Time           Source                Destination           Protocol Length Info
      4 1.000802       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x336a, seq=1/256, ttl=64 (reply in 5)

No.     Time           Source                Destination           Protocol Length Info
      5 1.001035       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x336a, seq=1/256, ttl=64 (request in 4)

No.     Time           Source                Destination           Protocol Length Info
      6 4.704100       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x0d73, seq=0/0, ttl=64 (reply in 7)

No.     Time           Source                Destination           Protocol Length Info
      7 4.704337       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x0d73, seq=0/0, ttl=64 (request in 6)

No.     Time           Source                Destination           Protocol Length Info
      8 5.706916       XX.YY.0.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x0d73, seq=1/256, ttl=64 (reply in 9)

No.     Time           Source                Destination           Protocol Length Info
      9 5.707163       XX.YY.0.137           XX.YY.0.5            ICMP     98     Echo (ping) reply    id=0x0d73, seq=1/256, ttl=64 (request in 8)

No.     Time           Source                Destination           Protocol Length Info
     10 15.979134      XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9613, seq=0/0, ttl=64 (no response found!)

No.     Time           Source                Destination           Protocol Length Info
     11 15.979358      Microchi_aa:bb:cc     Broadcast             ARP      60     Who has  XX.YY.2.5? Tell  XX.YY.0.137

Frame 11: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
    Sender IP address:  XX.YY.0.137
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address:  XX.YY.2.5

No.     Time           Source                Destination           Protocol Length Info
     12 16.995054      XX.YY.2.5             XX.YY.0.137          ICMP     98     Echo (ping) request  id=0x9613, seq=1/256, ttl=64 (no response found!)

No.     Time           Source                Destination           Protocol Length Info
     13 16.995268      Microchi_aa:bb:cc     Broadcast             ARP      60     Who has XX.YY.2.5? Tell XX.YY.0.137

Frame 13: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: Microchi_aa:bb:cc (80:1f:12:aa:bb:cc)
    Sender IP address: XX.YY.0.137
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address: XX.YY.2.5

The firewall configuration is unchanged between the two settings above. Only the coredevice gateware/firmware was updated. The faulty behavior persists when doing TCP requests instead of ICMP ones.

Given the captured ARP requests, it seems that the gateway is not configured properly on the coredevice. This faulty behavior is unchanged when using ip=use_dhcp.

If this is the issue, is there a way to set the gateway in the static IP case? For the DHCP case, I would expect the gateway to be broadcast by the DHCP server and set accordingly.

Your System (omit irrelevant parts)

  • Operating System: n/a
  • ARTIQ version: n/a
  • Version of the gateware and runtime loaded in the core device: 7.0.06ad76b.beta and 7.0.d17675e.beta
  • Hardware involved: Kasli v1.1

airwoodix avatar Jul 06 '22 11:07 airwoodix

@mbirtwell

sbourdeauducq avatar Jul 06 '22 12:07 sbourdeauducq

I'll try and take a look at this today or tomorrow.

mbirtwell avatar Jul 07 '22 09:07 mbirtwell

Reverted for release-7

sbourdeauducq avatar Jul 08 '22 09:07 sbourdeauducq

So it seems like this wasn't really intended to be supported by smoltcp. smoltcp used to have a feature where it would fill the neighbour cache from any packet that it saw to try and avoid unnecessary ARPs. But that was removed because it caused problems if there were certain buggy devices also on the network. See commit and PR.

The artiq firmware configures the IP address with 0 prefix bits. Effectively claiming that we're on the same sub-net as the entire internet. Which when coupled with the above smoltcp feature meant that every packet received would add an entry to the neighbour cache even if they weren't strictly speaking neighbours. So a packet that had been routed on to a subnet from another subnet would result in a neighbour cache entry mapping the origins IP to the routers MAC address. Again not strictly correct, but good enough to make this work in your case.

So options are:

  • Ask smoltcp if we can have the automatic neighbour cache population back again, might be possible with some extra filtering of the candidates like requiring it to have a unicast destination address. Or only doing it for packet that are addressed to us. I'll raise an issue on smoltcp.
  • Adding default route support. This'll still break for people on upgrade if they don't set a default route, but at least they can then do that to fix it. It should be easy to have the default route set from DHCP if you're using that.
  • Stay stuck on an old smoltcp version
  • Write this off as not supported. It seems like a bit of an accident that it worked in the first place to me.

That's roughly in order of my personal preference and there's a big gap between options 2 and 3.

mbirtwell avatar Jul 08 '22 12:07 mbirtwell

The artiq firmware configures the IP address with 0 prefix bits.

This is the issue. You should configure the smoltcp device with the correct prefix length: XX.YY.0.137/24. Then, when it wants to send packets to XX.YY.2.5, it'll see it's an out-of-subnet IP and send it to the default route instead (say, XX.YY.0.1), which knows how to route it.

If the default route is not in the ARP cache it'll find it with a Who has XX.YY.0.1? Tell XX.YY.0.137 ARP request. It should never do an ARP request for an out-of-subnet IP like the Who has XX.YY.2.5? Tell XX.YY.0.137 you're seeing now.

Dirbaio avatar Jul 08 '22 14:07 Dirbaio

Except that there is no default route, nor currently any support for setting one.

mbirtwell avatar Jul 08 '22 15:07 mbirtwell

Then you should add it! :)

Dirbaio avatar Jul 08 '22 16:07 Dirbaio

Thanks for the investigation! This look promising.

It is a quite fortunate accident that cross-subnet access used to work and is now quite central to our workflow. @mbirtwell do you have capacity to work on a resolution?

airwoodix avatar Jul 11 '22 08:07 airwoodix

@airwoodix You can use release 7 in the meantime.

sbourdeauducq avatar Jul 11 '22 08:07 sbourdeauducq

The behavior was not at all an accident and smoltcp was explicitly written to support this. It just turned out to be too fragile. @airwoodix and @mbirtwell what are your plans here?

jordens avatar Jul 26 '22 13:07 jordens

I think doing the work to add the default gateway configuration is the best way forwards. I don't mind doing that, but it's not likely to be soon.

mbirtwell avatar Jul 26 '22 14:07 mbirtwell

I also don't really have capacity to work on this at the moment. We track release-7 for now to work around it.

airwoodix avatar Jul 29 '22 09:07 airwoodix

I've managed to make a start of on this: https://github.com/m-labs/artiq/commit/d0fe2c5cc863f50c542e165f3c9538a2e8669db7, but it's not tested yet.

mbirtwell avatar Jul 31 '22 15:07 mbirtwell