sonic-swss
sonic-swss copied to clipboard
Fix orch stuch when removing vlan member (#3294)
What I did Fixes #3294
The root cause of this issue is the two data struct of vlan member info in orchagent is not in sync.
Why I did it Fix the bug
How I did it The return value of setPortPvid() is does not matter, so we ignore it.
So we get the m_members" and "m_portVlanMember" in sync all the time.
How I verified it
- enable the swss to debug log level
- paste the configuration that triggered this bug.
sudo config vlan member del 1000 EthernetXX
sudo config interface ip add EthernetXX 1.1.1.1/24
Details if related
When one deletes one interface from a vlan, then makes the interface to router interface(add a ip addr to interface)
+----------+
| bash cli |
+----+-----+
|
+---+
/ CFG \
\ DB /
+---+
|
+-------------+-----------+
| |
+----------+ +----------+
| vlanmgrd | | intfmgrd |
+--+-------+ +----+-----+
| |
+-------------+-----------+
|
+---+
/ APP \
\ DB /
+---+
|
+----+-----+
| orch |
+----------+
- The path of removing vlan member is left path as above.
- The path of creating a router interface is right path as above.
We have no way to ensure that above 2 config arrives at Orch in order. It's possible that the "creating a router interface" arrives at first. If so, then the "removing vlan member" will be failed.
There are 2 data struct storing the vlan member info in orchagent. (They should be in sync all the time.)
-
"class Port" in port.h
std::set<std::string> m_members;
The instance of "class Port" is vlan interface. it's "m_members" is a set of vlan member, like EhternetXX, EthernetYY...
-
"class PortsOrch" in portsorch.h
map<string, vlan_members_t> m_portVlanMember;
The instance of "class PortsOrch" is EthernetXX. It's m_portVlanMember is a map of vlan info, like Vlan100, Vlan200...
Please take a look at PortsOrch::removeVlanMember().
bool PortsOrch::removeVlanMember(Port &vlan, Port &port, string > e d_point_ip) { ........... sai_object_id_t vlan_member_id; sai_vlan_tagging_mode_t sai_tagging_mode; auto vlan_member = m_portVlanMember[port.m_alias].find(vlan.m_lan_info.vlan_id); /* Assert the port belongs to this VLAN */ assert (vlan_member != m_portVlanMember[port.m_alias].end()); sai_tagging_mode = vlan_member->second.vlan_mode; vlan_member_id = vlan_member->second.vlan_member_id; sai_status_t status = sai_vlan_api->remove_vlan_member(vlan_member_id); if (status != SAI_STATUS_SUCCESS) { ........... } m_portVlanMember[port.m_alias].erase(vlan_member); if (m_portVlanMember[port.m_alias].empty()) { m_portVlanMember.erase(port.m_alias); } .......... /* Restore to default pvid if this port joined this VLAN in untagged mode previously */ if (sai_tagging_mode == SAI_VLAN_TAGGING_MODE_UNTAGGED) { if (!setPortPvid(port, DEFAULT_PORT_VLAN_ID)) { return false; } } m_portList[port.m_alias] = port; vlan.m_members.erase(port.m_alias); m_portList[vlan.m_alias] = vlan; ............ return true; }
if setPortPvid() retrun fail(because the port is already a router interface). So the "m_members" and "m_portVlanMember" will be not in sync.
In the next enter of removeVlanMember() with same params. The iterator "vlan_member" is point to the end, but the assert() doesn't work because NOS is relase version.(if NOS is in debug version, the assert() will trigger abort()) When m_portVlanMember[port.m_alias].erase(vlan_member) erase this end iterator, the c++ std lib will be stuck, occupy CPU 100%.