exabgp icon indicating copy to clipboard operation
exabgp copied to clipboard

Add an next-hop to withdraw UPDATEs

Open russellkelly opened this issue 8 years ago • 20 comments

Hi,

This new issue popped up when I installed the latest exabgp 4.0. I remove a route and add a route, but for some reason there is an error and the peer is getting reset. This worked fine on a 4.0 install from 8 days or so ago...


Sat, 04 Feb 2017 23:32:12 | INFO     | 24     | configuration | . route            | '6.6.6.6' 'label' '[' '800000' ']'
Sat, 04 Feb 2017 23:32:12 | INFO     | 24     | reactor       | route removed from neighbor 172.24.74.46 local-ip 192.168.1.2 local-as 64512 peer-as 64512 router-id 192.168.1.2 family-allowed in-open : 6.6.6.6/32 label [ 800000 ] next-hop 0.0.0.0
Sat, 04 Feb 2017 23:32:12 | INFO     | 24     | reactor       | responding to http-api : done
Sat, 04 Feb 2017 23:32:12 | INFO     | 24     | reactor       | callback | removing
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | performing dynamic route update
192.168.1.3 - - [04/Feb/2017 23:32:13] "POST / HTTP/1.1" 200 -
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | updated peers dynamic routes successfully
Sat, 04 Feb 2017 23:32:13 | DEBUG    | 24     | timers        | peer 172.24.74.46 ASN 64512   Receive Timer 127 second(s) left
Sat, 04 Feb 2017 23:32:13 | DEBUG    | 24     | timers        | peer 172.24.74.46 ASN 64512   Send Timer 7 second(s) left
Sat, 04 Feb 2017 23:32:13 | DEBUG    | 24     | wire          | session 1 outgoing 192.168.1.2 / 172.24.74.46       SENDING  (  51) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0033 0200 0000 1C40 0101 0040 0200 4005 0400 0000 6480 0F0B 0001 0438 C350 0106 0606 06
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | message       | Peer    172.24.74.46 ASN 64512   >> 1 UPDATE(s)
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | processes     | Command from process http-api : announce route 6.6.6.6 next-hop 2.2.2.2 label [116512 116384] 
Sat, 04 Feb 2017 23:32:13 | WARNING  | 24     | reactor       | callback | handling 'announce route 6.6.6.6 next-hop 2.2.2.2 label [ 116512 116384 ]' with announce_route
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | callback | installing announce_route
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | callback | running
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | configuration | . route            | '6.6.6.6' 'next-hop' '2.2.2.2' 'label' '[' '116512' '116384' ']'
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | route added to neighbor 172.24.74.46 local-ip 192.168.1.2 local-as 64512 peer-as 64512 router-id 192.168.1.2 family-allowed in-open : 6.6.6.6/32 label [ 116512 116384 ] next-hop 2.2.2.2
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | responding to http-api : done
Sat, 04 Feb 2017 23:32:13 | INFO     | 24     | reactor       | callback | removing
Sat, 04 Feb 2017 23:32:14 | INFO     | 24     | reactor       | performing dynamic route update
Sat, 04 Feb 2017 23:32:14 | INFO     | 24     | reactor       | updated peers dynamic routes successfully
Sat, 04 Feb 2017 23:32:14 | DEBUG    | 24     | wire          | session 1 outgoing 192.168.1.2 / 172.24.74.46       RECEIVED  (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0016 03
Sat, 04 Feb 2017 23:32:14 | DEBUG    | 24     | wire          | session 1 outgoing 192.168.1.2 / 172.24.74.46       RECEIVED  (   3) 0303 03
Sat, 04 Feb 2017 23:32:14 | INFO     | 24     | message       | Peer    172.24.74.46 ASN 64512   << NOTIFICATION
**Sat, 04 Feb 2017 23:32:14 | INFO     | 24     | network       | Peer    172.24.74.46 ASN 64512   out loop, peer reset, message [notification received (3,3)] error[UPDATE message error / Missing Well-known Attribute / 0x03]
Sat, 04 Feb 2017 23:32:14 | DEBUG    | 24     | wire          | session 1 outgoing, closing connection from 192.168.1.2 to 172.24.74.46**
Sat, 04 Feb 2017 23:32:15 | DEBUG    | 24     | network       | out loop, skipping, not time yet
Sat, 04 Feb 2017 23:32:16 | DEBUG    | 24     | network       | out loop, intialising

What is causing this error? This peer is configured to be an mpls-nlri and unicast ipv4 - seems like the error is the two wire messages @ 23:32:14

Compare with this working example from a 4.0 version installed a week or so ago


Sun, 05 Feb 2017 06:20:10 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Receive Timer 158 second(s) left
Sun, 05 Feb 2017 06:20:10 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Send Timer 39 second(s) left
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | processes     | Command from process http-api : withdraw route 6.6.6.6  label [800000] 
Sun, 05 Feb 2017 06:20:10 | WARNING  | 9      | reactor       | callback | handling 'withdraw route 6.6.6.6 label [ 800000 ]' with withdraw_route
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | processes     | Command from process http-api :  
Sun, 05 Feb 2017 06:20:10 | WARNING  | 9      | reactor       | Command from process not understood : 
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | callback | installing withdraw_route
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | callback | running
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | configuration | . route            | '6.6.6.6' 'label' '[' '800000' ']'
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | route removed from neighbor 172.24.74.46 local-ip 192.168.0.2 local-as 64512 peer-as 64512 router-id 192.168.0.2 family-allowed in-open : 6.6.6.6/32 label [ 800000 ] next-hop 0.0.0.0
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | responding to http-api : done
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | callback | removing
10.95.5.21 - - [05/Feb/2017 06:20:10] "POST / HTTP/1.1" 200 -
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | performing dynamic route update
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | updated peers dynamic routes successfully
Sun, 05 Feb 2017 06:20:10 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Receive Timer 158 second(s) left
Sun, 05 Feb 2017 06:20:10 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Send Timer 39 second(s) left
Sun, 05 Feb 2017 06:20:10 | DEBUG    | 9      | wire          | session 1 outgoing 192.168.0.2 / 172.24.74.46       SENDING  (  37) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0025 0200 0000 0E80 0F0B 0001 0438 C350 0106 0606 06
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | message       | Peer    172.24.74.46 ASN 64512   >> 1 UPDATE(s)
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | processes     | Command from process http-api : announce route 6.6.6.6 next-hop 3.3.3.3 label [116633 116386] 
Sun, 05 Feb 2017 06:20:10 | WARNING  | 9      | reactor       | callback | handling 'announce route 6.6.6.6 next-hop 3.3.3.3 label [ 116633 116386 ]' with announce_route
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | callback | installing announce_route
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | callback | running
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | configuration | . route            | '6.6.6.6' 'next-hop' '3.3.3.3' 'label' '[' '116633' '116386' ']'
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | route added to neighbor 172.24.74.46 local-ip 192.168.0.2 local-as 64512 peer-as 64512 router-id 192.168.0.2 family-allowed in-open : 6.6.6.6/32 label [ 116633 116386 ] next-hop 3.3.3.3
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | responding to http-api : done
Sun, 05 Feb 2017 06:20:10 | INFO     | 9      | reactor       | callback | removing
Sun, 05 Feb 2017 06:20:11 | INFO     | 9      | reactor       | performing dynamic route update
Sun, 05 Feb 2017 06:20:11 | INFO     | 9      | reactor       | updated peers dynamic routes successfully
Sun, 05 Feb 2017 06:20:11 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Receive Timer 157 second(s) left
Sun, 05 Feb 2017 06:20:11 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Send Timer 38 second(s) left
Sun, 05 Feb 2017 06:20:11 | DEBUG    | 9      | wire          | session 1 outgoing 192.168.0.2 / 172.24.74.46       SENDING  (  67) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0043 0200 0000 2C80 0E14 0001 0404 0303 0303 0050 1C79 901C 6A21 0606 0606 4001 0100 4002 0040 0304 0303 0303 4005 0400 0000 64
Sun, 05 Feb 2017 06:20:11 | INFO     | 9      | message       | Peer    172.24.74.46 ASN 64512   >> 1 UPDATE(s)
Sun, 05 Feb 2017 06:20:12 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Receive Timer 156 second(s) left
Sun, 05 Feb 2017 06:20:12 | DEBUG    | 9      | timers        | peer 172.24.74.46 ASN 64512   Send Timer 37 second(s) left

russellkelly avatar Feb 04 '17 23:02 russellkelly

@brijohn could it be related to your changes ? I can not think of anything else which could have caused it.

thomas-mangin avatar Feb 06 '17 10:02 thomas-mangin

Thanks - btw, the exabgp build I am using is from late Nov last year, not 8 days ago (that's when the container was added to, but the exabgp was already installed), so these API and dict changes would only be in the new setup.

russellkelly avatar Feb 06 '17 12:02 russellkelly

@russellkelly I am confused ... the issue is in master and the bug was introduced anytime between now and November ? Is that what you are saying ?

thomas-mangin avatar Feb 06 '17 13:02 thomas-mangin

Hi Thomas - yes the issue is in master, and all I was saying is that the issue is not less than 8 days old, so the changes you referenced are only in my new setup, and seem to be related to the issue I'm seeing. Sorry for the confusion.

russellkelly avatar Feb 06 '17 13:02 russellkelly

So I don't think my change to the way update messages are generated should have caused this, by the time that code gets called we have already had the set of attributes we are going to be sending passed to us and we always include those in any Update message we generate.

I also have not seen this happen during any of our testing either. The version of 4.0 we use is also from around mid November, with the addition of my various patches that I have committed upstream.

brijohn avatar Feb 06 '17 14:02 brijohn

looked - could not find the root cause immediately, will investigate more later.

thomas-mangin avatar Feb 06 '17 19:02 thomas-mangin

https://github.com/Exa-Networks/exabgp/commit/6e4ab40e8fcf0e331071e5c415bb68db6a954fc5

As these are label stacks I am removing and adding a route for the same FEC (legitimately) - wondering if the above change stopped this working somehow...?

russellkelly avatar Feb 07 '17 04:02 russellkelly

Hi Thomas,

Looked into this further from my side, and it seemed I was sending a withdraw without a NH - so it was this causing the issue. It didn't bring down the session in the past - but now it does. Seems like Exa is working correctly IMO, as a NH should be specified.... I'll close this issue.

russellkelly avatar Feb 09 '17 23:02 russellkelly

Thank you @russellkelly

thomas-mangin avatar Feb 10 '17 12:02 thomas-mangin

Always adding a next-hop if not present in the withdraw https://github.com/Exa-Networks/exabgp/commit/b36d7910a4354791980d17a2a77b7df12df7cbc5

thomas-mangin avatar Mar 04 '17 18:03 thomas-mangin

I am a bit confused. Why would you send path attributes in withdraw only update?

jkldgoefgkljefogeg avatar Mar 01 '19 05:03 jkldgoefgkljefogeg

It should not be required but clearly from the bug report some implementations are expecting it ... "be liberal in what you accept and conservative in what you send" ... So if sending it (which is harmless) does prevent issues I see no harm in doing so (it waste a few CPU cycles)..

thomas-mangin avatar Mar 01 '19 13:03 thomas-mangin

I think I am seeing the exact opposite on Arista where sending withdraw with nexthop results in a BGP notification.

jkldgoefgkljefogeg avatar Mar 01 '19 20:03 jkldgoefgkljefogeg

sigh ! Time to re-read RFCs ...

thomas-mangin avatar Mar 01 '19 21:03 thomas-mangin

We just upgraded from 3.4.26 to 4.1.0 and our sessions are broken when sending and update withdrawing a route. Broken meaning our Nokia routers reset the session with Missing Well-known Attribute. In our case it appears that exabgp is no longer sending the routes to be withdrawn?
3.4.26.pdf 4.1.0.pdf

adudek16 avatar Feb 13 '20 19:02 adudek16

https://tools.ietf.org/html/rfc4760

   An UPDATE message that contains the MP_UNREACH_NLRI is not required
   to carry any other path attributes.

next-hop is an attribute.

thomas-mangin avatar Feb 13 '20 21:02 thomas-mangin

if you edit the file with the reverse of the patch above and remove the "and part", you can revert the behaviour to what Nokia seems to want.

thomas-mangin avatar Feb 13 '20 21:02 thomas-mangin

In text 3.4.26

        Update Message (2), length: 46
          Multi-Protocol Unreach NLRI (15), length: 20, Flags [O]:
            AFI: IPv6 (2), SAFI: Unicast (1)
              xxxx:xxxx:xxxx:xxxx::xxxx/128

vs 4.1.0

	Update Message (2), length: 30
	  Multi-Protocol Unreach NLRI (15), length: 3, Flags [OE]:
	    AFI: IPv6 (2), SAFI: Unicast (1)
	      End-of-Rib Marker (empty NLRI)

adudek16 avatar Feb 13 '20 22:02 adudek16

If I should open a new issue then I can. My issue is more about the missing "Withdrawn Routes" which I am pretty sure is required, otherwise why would you send an Update with Unreach NLRI?

adudek16 avatar Feb 13 '20 22:02 adudek16

I ran into the exact same issue today while testing ExaBGP 4.2.6 against an older version of FRR running on Cumulus Linux 3.7.12. And at first, I also thought that this was an ExaBGP issue since I was not facing any problems with ExaBGP 3.x in the past.

This is what I did on the ExaBGP 4.2.6 side (2001:db8::200):

exabgp-cli neighbor 2001:db8::100 announce route 2001:db8:dead:beef::1/128 next-hop 2001:db8:cafe::1
exabgp-cli neighbor 2001:db8::100 withdraw route 2001:db8:dead:beef::1/128 next-hop 2001:db8:cafe::1

And this is what happened on the Cumulus Linux 3.7.12 side (2001:db8::100):

cumulus-test bgpd[6734]: 2001:db8::200 Missing well-known attribute NEXT_HOP.
cumulus-test bgpd[6734]: %NOTIFICATION: sent to neighbor 2001:db8::200 3/3 (UPDATE Message Error/Missing Well-known Attribute) 1 bytes 03
cumulus-test bgpd[6734]: bgp_process_packet: BGP UPDATE receipt failed for peer: 2001:db8::200
cumulus-test bgpd[6734]: %ADJCHANGE: neighbor 2001:db8::200(Unknown) in vrf public Down BGP Notification send

Using tcpdump, you can see the difference between the announce and the withdraw update.

announce (there is a nexthop attribute):

21:12:30.686192 fa:16:3e:22:0a:8b > fa:16:3e:4f:6f:0b, ethertype IPv6 (0x86dd), length 171: (flowlabel 0x491a8, hlim 64, next-header TCP (6) payload length: 117) 2001:db8::200.41291 > 2001:db8::100.179: Flags [P.], cksum 0x0694 (correct), seq 1884233867:1884233944, ack 2547636196, win 507, options [nop,nop,md5 valid], length 77: BGP
	Update Message (2), length: 77
	  Origin (1), length: 1, Flags [T]: IGP
	    0x0000:  00
	  AS Path (2), length: 6, Flags [T]: 65200 
	    0x0000:  0201 0000 feb0
	  Multi-Protocol Reach NLRI (14), length: 38, Flags [O]: 
	    AFI: IPv6 (2), SAFI: Unicast (1)
	    nexthop: 2001:db8:cafe::1, nh-length: 16, no SNPA
	      2001:db8:dead:beef::1/128
	    0x0000:  0002 0110 2001 0db8 cafe 0000 0000 0000
	    0x0010:  0000 0001 0080 2001 0db8 dead beef 0000
	    0x0020:  0000 0000 0001

withdraw (there is no nexthop attribute):

21:12:32.103664 fa:16:3e:22:0a:8b > fa:16:3e:4f:6f:0b, ethertype IPv6 (0x86dd), length 153: (flowlabel 0x491a8, hlim 64, next-header TCP (6) payload length: 99) 2001:db8::200.41291 > 2001:db8::100.179: Flags [P.], cksum 0xc11d (correct), seq 77:136, ack 1, win 507, options [nop,nop,md5 valid], length 59: BGP
	Update Message (2), length: 59
	  Origin (1), length: 1, Flags [T]: IGP
	    0x0000:  00
	  AS Path (2), length: 6, Flags [T]: 65200 
	    0x0000:  0201 0000 feb0
	  Multi-Protocol Unreach NLRI (15), length: 20, Flags [O]: 
	    AFI: IPv6 (2), SAFI: Unicast (1)
	      2001:db8:dead:beef::1/128
	    0x0000:  0002 0180 2001 0db8 dead beef 0000 0000
	    0x0010:  0000 0001

Let me just rephrase quickly what Thomas mentioned earlier because it turns out that ExaBGP 4.2.6 is behaving correctly here:

According to RFC4760, "an UPDATE message that contains the MP_UNREACH_NLRI (see the tcpdump output of the withdraw above as an example) is not required to carry any other path attributes." - and because the nexthop is a path attribute, it is not required to be sent along in the Multi-Protocol Unreach NLRI section.

Now since I will need to work around this issue for the time being (basically until I receive a patch from Cumulus), I would like to understand the following statement in more detail:

Thomas Mangin wrote: if you edit the file with the reverse of the patch above and remove the "and part", you can revert the behaviour to what Nokia seems to want.

@thomas-mangin Hoping that this will help others facing the same issue, can you please shed some light on what exactly needs to be done in order to include the nexthop in all update messages containing a "Multi-Protocol Unreach NLRI"?

PS: I already tested ExaBGP 4.2.6 against FRR 7.3 and it seems that the issue has already been fixed in upstream FRR. So only older versions of FRR seem to be non-compliant with RFC4760.

subsecond avatar Apr 16 '20 22:04 subsecond