sonic-swss
sonic-swss copied to clipboard
[Flex counter] A buffer pool object can be removed before its counter is removed even if orchagent removes the counter first
Description
A buffer pool object can be removed before its counter is removed even if orchagent removes the counter first. This defect can occur on all objects that have a counter attached. This is because orchagent notifies sairedis to remove an object and the counter via different channels. There is no mechanism to keep the order between OA and sairedis. For object, it uses ASIC_DB channel but for a counter it uses FLEX_DB.
This issue is very similar to sonic-net/sonic-buildimage#14628 which is for the RIF object.
Steps to reproduce the issue:
It's a rare case for the buffer pool object. We observed it once when the zero buffer pool (for reclaiming buffer) is removed (all ports are started up) just after system warm reboot. The zero buffer pool was removed:
2023-12-16.04:32:29.911961|BUFFER_POOL_TABLE:ingress_zero_pool|DEL
According to the code, it will remove object first
else if (op == DEL_COMMAND)
{
...
if (SAI_NULL_OBJECT_ID != sai_object)
{
clearBufferPoolWatermarkCounterIdList(sai_object);
sai_status = sai_buffer_api->remove_buffer_pool(sai_object);
if (SAI_STATUS_SUCCESS != sai_status)
{
SWSS_LOG_ERROR("Failed to remove buffer pool %s with type %s, rv:%d", object_name.c_str(), map_type_name.c_str(), sai_status);
task_process_status handle_status = handleSaiRemoveStatus(SAI_API_BUFFER, sai_status);
if (handle_status != task_process_status::task_success)
{
return handle_status;
}
}
SWSS_LOG_NOTICE("Removed buffer pool %s with type %s", object_name.c_str(), map_type_name.c_str());
}
In clearBufferPoolWatermarkCounterIdList
it removes entry in FLEX_COUNTER_DB
void BufferOrch::clearBufferPoolWatermarkCounterIdList(const sai_object_id_t object_id)
{
if (m_isBufferPoolWatermarkCounterIdListGenerated)
{
string key = BUFFER_POOL_WATERMARK_STAT_COUNTER_FLEX_COUNTER_GROUP ":" + sai_serialize_object_id(object_id);
m_flexCounterTable->del(key);
}
}
But in the log we see the counter was still accessed and removed after the buffer pool had been removed
Dec 16 06:32:30.469317 r-spider-05 INFO syncd#SDK: :- processSingleEvent: key: SAI_OBJECT_TYPE_BUFFER_POOL:oid:0x18000000000a3d op: remove
Dec 16 06:32:30.469317 r-spider-05 NOTICE syncd#SDK: [SAI_BUFFER.NOTICE] ./src/mlnx_sai_buffer.c[2221]- mlnx_sai_remove_buffer_pool: Remove BUFFER_POOL [OID:0x400000018] [sx_cos_pool_id:4]
Dec 16 06:32:30.470509 r-spider-05 INFO syncd#SDK: :- sendApiResponse: sending response for SAI_COMMON_API_REMOVE api with status: SAI_STATUS_SUCCESS
Dec 16 06:32:30.722009 r-spider-05 INFO syncd#SDK: :- tryTranslateVidToRid: unable to get RID for VID oid:0x18000000000a3d
Dec 16 06:32:30.722061 r-spider-05 WARNING syncd#SDK: :- processFlexCounterEvent: port VID oid:0x18000000000a3d, was not found (probably port was removed/splitted) and will remove from counters now
Describe the results you received:
Describe the results you expected:
Output of show version
:
SONiC Software Version: SONiC.202305_RC.51-6416e238c_Internal
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 6416e238c
Build date: Thu Dec 14 04:28:16 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci02-242
Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1921X01546
Model Number: MSN2410-CB2FO
Hardware Revision: A2
Uptime: 06:42:44 up 4 min, 1 user, load average: 2.51, 3.07, 1.58
Date: Sat 16 Dec 2023 06:42:44
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx 202305_RC.51-6416e238c_Internal a16b904ebab1 838MB
docker-syncd-mlnx latest a16b904ebab1 838MB
docker-platform-monitor 202305_RC.51-6416e238c_Internal 82275a00b244 829MB
docker-platform-monitor latest 82275a00b244 829MB
docker-dhcp-relay latest 1e4780b04384 308MB
docker-macsec latest 7987aec2df36 320MB
docker-eventd 202305_RC.51-6416e238c_Internal dcceb37f9932 300MB
docker-eventd latest dcceb37f9932 300MB
docker-teamd 202305_RC.51-6416e238c_Internal e41fe22b368e 318MB
docker-teamd latest e41fe22b368e 318MB
docker-orchagent 202305_RC.51-6416e238c_Internal f147bf2f5dc4 330MB
docker-orchagent latest f147bf2f5dc4 330MB
docker-fpm-frr 202305_RC.51-6416e238c_Internal d699f298db1e 350MB
docker-fpm-frr latest d699f298db1e 350MB
docker-nat 202305_RC.51-6416e238c_Internal 2b56e4943d9d 321MB
docker-nat latest 2b56e4943d9d 321MB
docker-sflow 202305_RC.51-6416e238c_Internal 8fa0c1ba3454 320MB
docker-sflow latest 8fa0c1ba3454 320MB
docker-sonic-telemetry 202305_RC.51-6416e238c_Internal 3a0d24f463e1 387MB
docker-sonic-telemetry latest 3a0d24f463e1 387MB
docker-snmp 202305_RC.51-6416e238c_Internal 878593cde6f1 340MB
docker-snmp latest 878593cde6f1 340MB
docker-lldp 202305_RC.51-6416e238c_Internal 964407cc4c7b 343MB
docker-lldp latest 964407cc4c7b 343MB
docker-database 202305_RC.51-6416e238c_Internal 9fdbdd6b2caf 301MB
docker-database latest 9fdbdd6b2caf 301MB
docker-mux 202305_RC.51-6416e238c_Internal e4f0d9b05c52 349MB
docker-mux latest e4f0d9b05c52 349MB
docker-router-advertiser 202305_RC.51-6416e238c_Internal 409706828615 301MB
docker-router-advertiser latest 409706828615 301MB
docker-sonic-mgmt-framework 202305_RC.51-6416e238c_Internal 2a36085b813c 416MB
docker-sonic-mgmt-framework latest 2a36085b813c 416MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh 1.6.0-202305-25 3e820a00274a 433MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai 1.1.0-202305-36 f3755210d1c0 276MB
Output of show techsupport
:
(paste your output here or download and attach the file here )