frr icon indicating copy to clipboard operation
frr copied to clipboard

ISIS Neighbor Stuck in Expiring

Open beith12 opened this issue 2 years ago • 7 comments

  • [X] Did you check if this is a duplicate issue?
  • [ ] Did you test it on the latest FRRouting/frr master branch?

Problem:

On certain links (those that are active for a period of time but not permanently) we have seen ISIS enter a 'Down/Expiring' state and never recover. We are able to clear this state by restarting the FRR process, but not if we use 'clear isis neighbor'. We have seen the issue on both 'broadcast' and 'point-to-point' network types. We have also seen the issue on version 8.3.1 as well as 8.5.2. We added BFD to try help force the neighbor down but this does not resolve the problem.

Some item values below are masked with dummy values.

#Shell Output (show isis neighbor) 
Area ABC:
  System Id           Interface        L  State        Holdtime   SNPA
  abc                    eno4              2  Down        Expiring     0000.1111.2222

Configuration Snippet:

interface eno4
 ip router isis ABC
 isis bfd
 isis bfd profile STANDARD
 isis circuit-type level-2-only
 isis metric 25
 isis password md5 ***
 isis priority 127
 no isis hello padding
!
router isis ABC
 is-type level-2-only
 net 49.0000.wwwx.xxyy.yzzz.00
 lsp-mtu 1400
 lsp-timers gen-interval 1 refresh-interval 300 max-lifetime 1000
 log-adjacency-changes
 segment-routing on
 segment-routing node-msd 8
 segment-routing prefix w.x.y.z/32 index z
!
bfd
 profile STANDARD
  detect-multiplier 10
  transmit-interval 1000
  receive-interval 1000
  echo receive-interval disabled 

To Reproduce: This is not a consistent issue, but when the ISIS neighbor flaps in quick succession we typically see this stuck in expiring issue.

Expected behavior: ISIS neighbor is cleared after timeout expires.

Versions: FRR version 8.5.2 Linux Info: Ubuntu 18.04.4 LTS; kernel 4.15.0-197-generic

beith12 avatar Oct 23 '23 15:10 beith12

This is a message from the log before we see the stuck in expiring state.

ISIS: [XXX] %ADJCHANGE: Adjacency to abc (eno4) for level-2 changed from Unknown to Initializing, unspecified ISIS: [XXX] %ADJCHANGE: Adjacency to abc (eno4) for level-2 changed from Initializing to Down, holding time expired

beith12 avatar Nov 02 '23 12:11 beith12

This sounds like a valid issue, but it's going to be hard to track down without being able to consistently reproduce it, I think ... Can y'all try to narrow down when this happens more?

riw777 avatar Nov 07 '23 13:11 riw777

Do you know of a specific trace or debug that might help us pinpoint the issue? This seems to happen when the neighbor does not fully establish (intermittent up/down link) which makes it hard to reproduce in a lab.

beith12 avatar Nov 07 '23 13:11 beith12

I suggest to activate ISIS debug message and look to the log to see if you could find more useful info. Please, post them to this issue. Useful debug should be:

debug isis adj-packets
debug isis events
debug isis packet-dump
debug isis snp-packets

odd22 avatar Nov 07 '23 16:11 odd22

@odd22 Debugs with timestamps from expiring event

2023/11/20 21:16:46 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 65 . 2023/11/20 21:16:47 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:16:47 ISIS: [XXXX] ISIS-Adj (ABC): Adjacency state change 0->1: unspecified 2023/11/20 21:16:47 ISIS: [XXXX] %ADJCHANGE: Adjacency to abc (eno4) for level-2 changed from Unknown to Initializing, unspecified 2023/11/20 21:16:47 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:16:47 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:16:50 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:16:50 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:16:50 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:16:53 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:16:53 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:16:53 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:16:56 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:16:56 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:16:56 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:16:59 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:16:59 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:16:59 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:02 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:02 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:02 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:04 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:05 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:05 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:07 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:08 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:08 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:10 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:10 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:10 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:13 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:13 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:13 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:16 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:16 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:16 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:19 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:19 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:19 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:21 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:22 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:22 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:24 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:25 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH on eno4, cirType L2, cirID 5 2023/11/20 21:17:25 ISIS: [XXXX] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 0000.1111.2222 on eno4, cirType L2, cirID 5, length 61 2023/11/20 21:17:27 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:30 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:33 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:36 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:39 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:42 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:45 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:48 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:50 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:53 ISIS: [XXXX] ISIS-Adj (ABC): Sending L2 LAN IIH on eno4, length 73 2023/11/20 21:17:55 ISIS: [XXXX] ISIS-Adj (ABC): Adjacency state change 1->3: holding time expired 2023/11/20 21:17:55 ISIS: [XXXX] %ADJCHANGE: Adjacency to abc (eno4) for level-2 changed from Initializing to Down, holding time expired

beith12 avatar Nov 21 '23 15:11 beith12

I've managed to consistently reproduce this problem in a lab:

!LAB Topology R1 <-L2 Switch-> R2

!Reproduction Steps On R2 restart FRR service multiple times in short period

!Outcome

#Up State

R1# show isis ne de Area LAB: R2
Interface: ens4, Level: 2, State: Up, Expires in 29s Adjacency flaps: 1, Last: 3m20s ago Circuit type: L2, Speaks: IPv4 SNPA: 5001.0002.0002, LAN id: 0010.0100.1001.03

!!!! R2 FRR service restarts initiated !!!!

#Initializing State

R1# sh isis ne de Area LAB: R2
Interface: ens4, Level: 2, State: Initializing, Expires in 20s Adjacency flaps: 1, Last: 3m49s ago Circuit type: L2, Speaks: IPv4 SNPA: 5001.0002.0002, LAN id: 0000.0000.0000.00

#Down/Expiring State

R1# sh isis ne de Area LAB: R2
Interface: ens4, Level: 2, State: Down Expiring Adjacency flaps: 1, Last: 4m33s ago Circuit type: L2, Speaks: IPv4 SNPA: 5001.0002.0002, LAN id: 0000.0000.0000.00

beith12 avatar Mar 20 '24 15:03 beith12

@odd22 Looking further into how I reproduced this error - It seems that when I stop the FRR service on one side of an ISIS peer (depending on how the peers are configured) I experience the ISIS Expiring problem:

ISIS P2P with BFD: When FRR service is stopped on Device-A, the active node (Device-B) receives a BFD down message followed by an ISIS hello - this triggers the state of Initializing > Down Expiring

ISIS P2P with no BFD: When FRR service is stopped on Device-A, the active node (Device-B) seems to act as expected but relies on ISIS timers.

ISIS B/Cast with BFD: When FRR service is stopped on Device-A, the active node (Device-B) receives a BFD down message followed by an ISIS hello - this triggers the state of Initializing > Down Expiring

ISIS B/Cast with no BFD: When FRR service is stopped on Device-A, the active node (Device-B) records the following messages in logs - this triggers the state of Initializing > Down Expiring:

2024/05/26 15:02:04 ISIS: [ABC] ISIS-Adj (ABC): Starting fast-reroute on state change 2->1: own SNPA not found in LAN Neighbours TLV 2024/05/26 15:02:04 ISIS: [ABC] ISIS-Adj (ABC): Adjacency state change 2->1: own SNPA not found in LAN Neighbours TLV 2024/05/26 15:02:04 ISIS: [ABC] ISIS-Adj (ABC): Rcvd L2 LAN IIH from 5001.0001.0001 on ens5, cirType L2, cirID 4, length 40 2024/05/26 15:02:04 ISIS: [ABC] ISIS-Evt (ABC) DIS status change 2024/05/26 15:02:04 ISIS: [ABC] ISIS-Adj (ABC): Sending L2 LAN IIH on ens5, length 48 2024/05/26 15:02:07 ISIS: [ABC] ISIS-Adj (ABC): Sending L2 LAN IIH on ens5, length 48 2024/05/26 15:02:10 ISIS: [ABC] ISIS-Adj (ABC): Sending L2 LAN IIH on ens5, length 48 2024/05/26 15:02:13 ISIS: [ABC] ISIS-Adj (ABC): Sending L2 LAN IIH on ens5, length 48 2024/05/26 15:02:14 ISIS: [ABC] ISIS-Adj (ABC): Adjacency state change 1->3: holding time expired

beith12 avatar May 28 '24 16:05 beith12

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

github-actions[bot] avatar Feb 05 '25 01:02 github-actions[bot]

This issue will be automatically closed in the specified period unless there is further activity.

frrbot[bot] avatar Feb 05 '25 01:02 frrbot[bot]