pyroute2
pyroute2 copied to clipboard
NDB: unicast FDB entry missing in NDB dump, but present in dump via IPRoute
Ugh it will be a lot of text 😃
Intro
I have function, that compares all the FDB records we got via NDB.fdb.dump()
with corresponding records we got via IPRoute.fdb("dump")
.
Comparison code:
def check_fdb_consistency(self) -> None:
records = list(self.ndb.fdb.dump())
ndb_fdb = {
(
record.ifindex,
record.dst or None,
record.lladdr,
record.vlan or None,
record.master,
)
for record in records
}
fdb_dump = self.ip.fdb("dump")
for fdb in fdb_dump:
item = (
fdb["ifindex"],
fdb.get_attr("NDA_DST"),
fdb.get_attr("NDA_LLADDR"),
fdb.get_attr("NDA_VLAN"),
fdb.get_attr("NDA_MASTER"),
)
if item not in ndb_fdb:
LOG.error(f"NDB FDB: {item} not found in NDB dump!")
Before reproducing:
I've added some logging into pyroute2/ndb/objects/neighbour.py
: https://github.com/C2Devel/pyroute2/commit/d5d5f5cd07290b39e598a65403b1071af4fc971f
FDB state before reproducing:
[centos@vtep1 ~]$ bridge fdb show
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:04:88:a1 dev eth0 self permanent
33:33:00:00:00:01 dev docker0 self permanent
01:00:5e:00:00:01 dev docker0 self permanent
33:33:ff:9a:f7:f4 dev docker0 self permanent
02:42:dd:9a:f7:f4 dev docker0 master docker0 permanent
02:42:dd:9a:f7:f4 dev docker0 vlan 1 master docker0 permanent
33:33:00:00:00:01 dev br-vxlan self permanent
01:00:5e:00:00:01 dev br-vxlan self permanent
33:33:ff:a3:49:b3 dev br-vxlan self permanent
ae:75:a9:46:b1:44 dev br-vxlan vlan 1 master br-vxlan permanent
33:33:00:00:00:01 dev ovs-system self permanent
ae:75:a9:46:b1:44 dev brbond1 master br-vxlan permanent
33:33:00:00:00:01 dev brbond1 self permanent
Start fdb monitoring:
[centos@vtep1 ~]$ bridge monitor fdb
Catching the missing event
This monitoring output will be empty until my service adds a vxlan interface and appends two FDBs, one multicast and one unicast:
2023.12.27 13:34:29.415 ( hw.py:0398): I: vxln7: creating vxlan interface, vlan_id=34...
2023.12.27 13:34:31.503 ( hw.py:0483): I: vxln7: append FDB 00:00:00:00:00:00 to 172.31.0.6
2023.12.27 13:34:31.509 ( hw.py:0507): I: vxln7: adding FDB d0:fe:00:00:00:07 to 172.31.0.6
NDB shows a lot of log entries (previously added logging):
2023-12-27 13:34:29,501 INFO pyroute2.ndb.140475340737120.schema: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 26, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 0, 'ndm_updated': 0, 'ndm_refcnt': 0}), ('NDA_IFINDEX', 26)], 'header': {'length': 68, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=5, delta=-1, delay=0)}, 'event': 'RTM_NEWNEIGH'}
2023.12.27 13:34:29.501 (neighbour.py:0026): I: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 26, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 0, 'ndm_updated': 0, 'ndm_refcnt': 0}), ('NDA_IFINDEX', 26)], 'header': {'length': 68, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=5, delta=-1, delay=0)}, 'event': 'RTM_NEWNEIGH'}
2023-12-27 13:34:29,506 INFO pyroute2.ndb.140475340737120.schema: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 26, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 0, 'ndm_updated': 0, 'ndm_refcnt': 0}), ('NDA_VLAN', 1), ('NDA_IFINDEX', 26)], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=4, delta=-1, delay=0)}, 'event': 'RTM_NEWNEIGH'}
...
And at that moment bridge monitor fdb
shows the following events:
5e:8b:f8:80:f3:37 dev vxln7 master br-vxlan permanent
5e:8b:f8:80:f3:37 dev vxln7 vlan 1 master br-vxlan permanent
Deleted ae:75:a9:46:b1:44 dev br-vxlan vlan 1 master br-vxlan permanent
5e:8b:f8:80:f3:37 dev vxln7 vlan 34 master br-vxlan permanent
ae:75:a9:46:b1:44 dev brbond1 vlan 34 master br-vxlan permanent
Deleted ae:75:a9:46:b1:44 dev brbond1 vlan 34 master br-vxlan permanent
ae:75:a9:46:b1:44 dev brbond1 vlan 34 master br-vxlan permanent
00:00:00:00:00:00 dev vxln7 dst 172.31.0.6 self offload permanent
00:00:00:00:00:00 dev vxln7 self permanent
d0:fe:00:00:00:07 dev vxln7 dst 172.31.0.6 self offload
d0:fe:00:00:00:07 dev vxln7 vlan 34 self permanent
Then I run the consistency checking function, and it complains that one FDB entry from IPRoute.fdb("dump")
is not found in dump it got via NDB:
2023.12.27 13:34:35.421 ( hw.py:0670): E: NDB FDB: (4, None, '5e:8b:f8:80:f3:37', 1, 4) not found in NDB dump!
That tuple contains: (ifindex, dst, lladdr, vlan, master),
I double-checked manually and it's really absent if you try to dump via preexisting NDB object. Also there was no log entry about it too.
Using new NDB object in the same python process to check if NDB missed the event
But if you run dump via newly created object - there will be this FDB record, as well as log entry:
In [12]: NDB(log="debug").fdb.dump()
2023-12-27 13:50:25,409 DEBUG pyroute2.ndb.140474915607408.sources.localhost: init
2023.12.27 13:50:25.409 ( events.py:0084): D: init
2023-12-27 13:50:25,410 DEBUG pyroute2.ndb.140474915607408.sources.localhost: starting the source
2023.12.27 13:50:25.410 ( source.py:0448): D: starting the source
2023-12-27 13:50:25,411 DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: init
2023.12.27 13:50:25.411 ( events.py:0084): D: init
2023-12-27 13:50:25,412 DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: starting the source
2023.12.27 13:50:25.412 ( source.py:0448): D: starting the source
2023-12-27 13:50:25,413 DEBUG pyroute2.ndb.140474915607408.sources.localhost: connecting
2023.12.27 13:50:25.413 ( events.py:0084): D: connecting
2023-12-27 13:50:25,413 DEBUG pyroute2.ndb.140474915607408.sources.localhost: loading
2023.12.27 13:50:25.413 ( events.py:0084): D: loading
2023-12-27 13:50:25,413 DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: connecting
2023.12.27 13:50:25.413 ( events.py:0084): D: connecting
2023-12-27 13:50:25,415 DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: loading
2023.12.27 13:50:25.415 ( events.py:0084): D: loading
2023-12-27 13:50:25,418 DEBUG pyroute2.ndb.140474915607408.sources.localhost: running
2023.12.27 13:50:25.418 ( events.py:0084): D: running
2023-12-27 13:50:25,428 DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: running
2023.12.27 13:50:25.428 ( events.py:0084): D: running
...
2023.12.27 13:50:25.439 (neighbour.py:0026): I: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 4, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 95598, 'ndm_updated': 95598, 'ndm_refcnt': 0}), ('NDA_VLAN', 1), ('NDA_IFINDEX', 4)], 'header': {'length': 76, 'type': 28, 'flags': 2, 'sequence_number': 257, 'pid': 4294957806, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'}
...
# a lot of other events logged as well, but it's ok with that records
Out[12]:
('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:ff:04:88:a1', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:ff:9a:f7:f4', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 0, None, None, 3, 3, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 1, None, None, 3, 3, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:ff:a3:49:b3', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 1, None, None, 4, 4, None, None) # <-- HERE IT IS!
('localhost', 0, 7, 5, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 5, None, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 34, None, None, 6, 4, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 0, None, None, 6, 4, None, None)
('localhost', 0, 7, 6, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 6, None, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 34, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 0, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 196, 18, 1, '172.31.0.6', '00:00:00:00:00:00', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 2, 82, 1, '172.31.0.6', 'd0:fe:00:00:00:07', None, 0, None, None, 26, None, None, None)
2023.12.27 13:50:25.622 (selector_events:0054): D: Using selector: EpollSelector
If you run dump on old object, that was instantiated at the beginning dump again (or check routine) that FDB record will be still missing:
In [13]: ndb.fdb.dump()
Out[13]:
('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:ff:04:88:a1', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:ff:9a:f7:f4', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 0, None, None, 3, 3, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 1, None, None, 3, 3, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:ff:a3:49:b3', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 5, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 5, None, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 0, None, None, 6, 4, None, None)
('localhost', 0, 7, 6, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 6, None, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 0, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 1, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 34, None, None, 26, 4, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 34, None, None, 6, 4, None, None)
('localhost', 0, 7, 26, 196, 18, 1, '172.31.0.6', '00:00:00:00:00:00', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 128, 2, 0, '', '00:00:00:00:00:00', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 2, 82, 1, '172.31.0.6', 'd0:fe:00:00:00:07', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 128, 2, 0, '', 'd0:fe:00:00:00:07', None, 34, None, None, 26, None, None, None)
In [15]: vtep.check_fdb_consistency()
2023.12.27 13:53:43.169 ( hw.py:0649): D: FDB cache: checking consistency with NDB...
2023.12.27 13:53:43.172 ( hw.py:0670): E: NDB FDB: (4, None, '5e:8b:f8:80:f3:37', 1, 4) not found in NDB dump!
Could you please guide me how to debug this behaviour properly? I can reproduce it on demand.
Thanks for the detailed report, investigating
@svinota, any ideas what can cause it, where to look for a problem? It's the last issue i'm facing that stops from using NDB interface, btw %)
The issue is that I can not reproduce it on the master branch, even using the same values that you provided.
Could you please drop any script that would reproduce the situation? No restrictions on addresses or tools, I'm running a dedicated VM for debugging.
Fixed in #1200