pyroute2 icon indicating copy to clipboard operation
pyroute2 copied to clipboard

NDB: unicast FDB entry missing in NDB dump, but present in dump via IPRoute

Open pkulev opened this issue 1 year ago • 3 comments

Ugh it will be a lot of text 😃

Intro

I have function, that compares all the FDB records we got via NDB.fdb.dump() with corresponding records we got via IPRoute.fdb("dump").

Comparison code:

    def check_fdb_consistency(self) -> None:
        records = list(self.ndb.fdb.dump())

        ndb_fdb = {
            (
                record.ifindex, 
                record.dst or None, 
                record.lladdr, 
                record.vlan or None, 
                record.master,
            )
            for record in records
        }

        fdb_dump = self.ip.fdb("dump")
        for fdb in fdb_dump:
            item = (
                fdb["ifindex"],
                fdb.get_attr("NDA_DST"),
                fdb.get_attr("NDA_LLADDR"),
                fdb.get_attr("NDA_VLAN"),
                fdb.get_attr("NDA_MASTER"),
            )
            if item not in ndb_fdb:
                LOG.error(f"NDB FDB: {item} not found in NDB dump!")

Before reproducing:

I've added some logging into pyroute2/ndb/objects/neighbour.py: https://github.com/C2Devel/pyroute2/commit/d5d5f5cd07290b39e598a65403b1071af4fc971f

FDB state before reproducing:

[centos@vtep1 ~]$ bridge fdb show
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:04:88:a1 dev eth0 self permanent
33:33:00:00:00:01 dev docker0 self permanent
01:00:5e:00:00:01 dev docker0 self permanent
33:33:ff:9a:f7:f4 dev docker0 self permanent
02:42:dd:9a:f7:f4 dev docker0 master docker0 permanent
02:42:dd:9a:f7:f4 dev docker0 vlan 1 master docker0 permanent
33:33:00:00:00:01 dev br-vxlan self permanent
01:00:5e:00:00:01 dev br-vxlan self permanent
33:33:ff:a3:49:b3 dev br-vxlan self permanent
ae:75:a9:46:b1:44 dev br-vxlan vlan 1 master br-vxlan permanent
33:33:00:00:00:01 dev ovs-system self permanent
ae:75:a9:46:b1:44 dev brbond1 master br-vxlan permanent
33:33:00:00:00:01 dev brbond1 self permanent

Start fdb monitoring:

[centos@vtep1 ~]$ bridge monitor fdb

Catching the missing event

This monitoring output will be empty until my service adds a vxlan interface and appends two FDBs, one multicast and one unicast:

2023.12.27 13:34:29.415 (     hw.py:0398): I: vxln7: creating vxlan interface, vlan_id=34...
2023.12.27 13:34:31.503 (     hw.py:0483): I: vxln7: append FDB 00:00:00:00:00:00 to 172.31.0.6
2023.12.27 13:34:31.509 (     hw.py:0507): I: vxln7: adding FDB d0:fe:00:00:00:07 to 172.31.0.6

NDB shows a lot of log entries (previously added logging):

2023-12-27 13:34:29,501     INFO pyroute2.ndb.140475340737120.schema: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 26, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 0, 'ndm_updated': 0, 'ndm_refcnt': 0}), ('NDA_IFINDEX', 26)], 'header': {'length': 68, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=5, delta=-1, delay=0)}, 'event': 'RTM_NEWNEIGH'}
2023.12.27 13:34:29.501 (neighbour.py:0026): I: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 26, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 0, 'ndm_updated': 0, 'ndm_refcnt': 0}), ('NDA_IFINDEX', 26)], 'header': {'length': 68, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=5, delta=-1, delay=0)}, 'event': 'RTM_NEWNEIGH'}
2023-12-27 13:34:29,506     INFO pyroute2.ndb.140475340737120.schema: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 26, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 0, 'ndm_updated': 0, 'ndm_refcnt': 0}), ('NDA_VLAN', 1), ('NDA_IFINDEX', 26)], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=4, delta=-1, delay=0)}, 'event': 'RTM_NEWNEIGH'}
...

And at that moment bridge monitor fdb shows the following events:

5e:8b:f8:80:f3:37 dev vxln7 master br-vxlan permanent
5e:8b:f8:80:f3:37 dev vxln7 vlan 1 master br-vxlan permanent
Deleted ae:75:a9:46:b1:44 dev br-vxlan vlan 1 master br-vxlan permanent
5e:8b:f8:80:f3:37 dev vxln7 vlan 34 master br-vxlan permanent
ae:75:a9:46:b1:44 dev brbond1 vlan 34 master br-vxlan permanent
Deleted ae:75:a9:46:b1:44 dev brbond1 vlan 34 master br-vxlan permanent
ae:75:a9:46:b1:44 dev brbond1 vlan 34 master br-vxlan permanent
00:00:00:00:00:00 dev vxln7 dst 172.31.0.6 self offload permanent
00:00:00:00:00:00 dev vxln7 self permanent
d0:fe:00:00:00:07 dev vxln7 dst 172.31.0.6 self offload
d0:fe:00:00:00:07 dev vxln7 vlan 34 self permanent

Then I run the consistency checking function, and it complains that one FDB entry from IPRoute.fdb("dump") is not found in dump it got via NDB:

2023.12.27 13:34:35.421 (     hw.py:0670): E: NDB FDB: (4, None, '5e:8b:f8:80:f3:37', 1, 4) not found in NDB dump!

That tuple contains: (ifindex, dst, lladdr, vlan, master),

I double-checked manually and it's really absent if you try to dump via preexisting NDB object. Also there was no log entry about it too.

Using new NDB object in the same python process to check if NDB missed the event

But if you run dump via newly created object - there will be this FDB record, as well as log entry:

In [12]: NDB(log="debug").fdb.dump()
2023-12-27 13:50:25,409    DEBUG pyroute2.ndb.140474915607408.sources.localhost: init
2023.12.27 13:50:25.409 ( events.py:0084): D: init
2023-12-27 13:50:25,410    DEBUG pyroute2.ndb.140474915607408.sources.localhost: starting the source
2023.12.27 13:50:25.410 ( source.py:0448): D: starting the source
2023-12-27 13:50:25,411    DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: init
2023.12.27 13:50:25.411 ( events.py:0084): D: init
2023-12-27 13:50:25,412    DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: starting the source
2023.12.27 13:50:25.412 ( source.py:0448): D: starting the source
2023-12-27 13:50:25,413    DEBUG pyroute2.ndb.140474915607408.sources.localhost: connecting
2023.12.27 13:50:25.413 ( events.py:0084): D: connecting
2023-12-27 13:50:25,413    DEBUG pyroute2.ndb.140474915607408.sources.localhost: loading
2023.12.27 13:50:25.413 ( events.py:0084): D: loading
2023-12-27 13:50:25,413    DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: connecting
2023.12.27 13:50:25.413 ( events.py:0084): D: connecting
2023-12-27 13:50:25,415    DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: loading
2023.12.27 13:50:25.415 ( events.py:0084): D: loading
2023-12-27 13:50:25,418    DEBUG pyroute2.ndb.140474915607408.sources.localhost: running
2023.12.27 13:50:25.418 ( events.py:0084): D: running
2023-12-27 13:50:25,428    DEBUG pyroute2.ndb.140474915607408.sources.localhost/nsmanager: running
2023.12.27 13:50:25.428 ( events.py:0084): D: running
...

2023.12.27 13:50:25.439 (neighbour.py:0026): I: HIT AF_BRIDGE: {'family': 7, '__pad': (), 'ifindex': 4, 'state': 128, 'flags': 0, 'ndm_type': 0, 'attrs': [('NDA_LLADDR', '5e:8b:f8:80:f3:37'), ('NDA_MASTER', 4), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 95598, 'ndm_updated': 95598, 'ndm_refcnt': 0}), ('NDA_VLAN', 1), ('NDA_IFINDEX', 4)], 'header': {'length': 76, 'type': 28, 'flags': 2, 'sequence_number': 257, 'pid': 4294957806, 'error': None, 'target': 'localhost', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'}
...
# a lot of other events logged as well, but it's ok with that records

Out[12]:

('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:ff:04:88:a1', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:ff:9a:f7:f4', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 0, None, None, 3, 3, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 1, None, None, 3, 3, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:ff:a3:49:b3', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 1, None, None, 4, 4, None, None)  # <-- HERE IT IS!
('localhost', 0, 7, 5, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 5, None, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 34, None, None, 6, 4, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 0, None, None, 6, 4, None, None)
('localhost', 0, 7, 6, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 6, None, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 34, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 0, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 196, 18, 1, '172.31.0.6', '00:00:00:00:00:00', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 2, 82, 1, '172.31.0.6', 'd0:fe:00:00:00:07', None, 0, None, None, 26, None, None, None)

2023.12.27 13:50:25.622 (selector_events:0054): D: Using selector: EpollSelector

If you run dump on old object, that was instantiated at the beginning dump again (or check routine) that FDB record will be still missing:


In [13]: ndb.fdb.dump()
Out[13]:

('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 2, 128, 2, 0, '', '33:33:ff:04:88:a1', None, 0, None, None, 2, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 2, 0, '', '33:33:ff:9a:f7:f4', None, 0, None, None, 3, None, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 0, None, None, 3, 3, None, None)
('localhost', 0, 7, 3, 128, 0, 0, '', '02:42:dd:9a:f7:f4', None, 1, None, None, 3, 3, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '01:00:5e:00:00:01', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 4, 128, 2, 0, '', '33:33:ff:a3:49:b3', None, 0, None, None, 4, None, None, None)
('localhost', 0, 7, 5, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 5, None, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 0, None, None, 6, 4, None, None)
('localhost', 0, 7, 6, 128, 2, 0, '', '33:33:00:00:00:01', None, 0, None, None, 6, None, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 0, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 1, None, None, 26, 4, None, None)
('localhost', 0, 7, 26, 128, 0, 0, '', '5e:8b:f8:80:f3:37', None, 34, None, None, 26, 4, None, None)
('localhost', 0, 7, 6, 128, 0, 0, '', 'ae:75:a9:46:b1:44', None, 34, None, None, 6, 4, None, None)
('localhost', 0, 7, 26, 196, 18, 1, '172.31.0.6', '00:00:00:00:00:00', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 128, 2, 0, '', '00:00:00:00:00:00', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 2, 82, 1, '172.31.0.6', 'd0:fe:00:00:00:07', None, 0, None, None, 26, None, None, None)
('localhost', 0, 7, 26, 128, 2, 0, '', 'd0:fe:00:00:00:07', None, 34, None, None, 26, None, None, None)

In [15]: vtep.check_fdb_consistency()
2023.12.27 13:53:43.169 (     hw.py:0649): D: FDB cache: checking consistency with NDB...
2023.12.27 13:53:43.172 (     hw.py:0670): E: NDB FDB: (4, None, '5e:8b:f8:80:f3:37', 1, 4) not found in NDB dump!

Could you please guide me how to debug this behaviour properly? I can reproduce it on demand.

pkulev avatar Dec 27 '23 15:12 pkulev

Thanks for the detailed report, investigating

svinota avatar Jan 06 '24 21:01 svinota

@svinota, any ideas what can cause it, where to look for a problem? It's the last issue i'm facing that stops from using NDB interface, btw %)

pkulev avatar Feb 06 '24 14:02 pkulev

The issue is that I can not reproduce it on the master branch, even using the same values that you provided.

Could you please drop any script that would reproduce the situation? No restrictions on addresses or tools, I'm running a dedicated VM for debugging.

svinota avatar Feb 07 '24 10:02 svinota

Fixed in #1200

pkulev avatar Jul 09 '24 15:07 pkulev