Fix Huawei WLC AP Parsing crash on offline APs
General information
Affected Device: Huawei AC6508 Wireless LAN Controller (WLC) Software Version: Huawei NMS Version 2.23.00.100.932 CheckMK Version: MSP 2.3.0p12 Issue Summary: The Huawei AC6508 WLC crashes when an offline Access Point (AP) is reported by the controller. The crash occurs due to an out-of-range access attempt on the aps_info2 list when the AP is no longer available, resulting in a system failure.
Bug reports
Operating System: Debian 12 running Checkmk MSP 2.3.0p12
Steps to Reproduce:
- Run a device discovery on a Huawei AC6508 WLC Controller with disconnected (offline) Access Points.
- Run the attached string_table against the function parse_huawei_wlc_aps test.txt
SNMPWalk Following File contains a Full SNMPWalk which is redacted. Huawei AC6508 WLC snmpwalk.txt
Crash Report ID: ID: 48274138-87a8-11ef-a33a-0050568fc548
Proposed changes
Expected Behavior: When an AP goes offline, CheckMK should handle it gracefully, avoiding any out-of-range memory access. The system should log appropriate warnings or errors rather than attempting to access data from the unavailable AP.
Proposed Patch Change: The patch modifies the AP handling logic to skip further parsing of AP information when the AP is not available. This prevents out-of-range access and stabilizes the system.
All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.
I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.
Hi @throjaisnn, Thanks for creating the pull request! We have a question with regards how the filtering of all offline APs affects Checkmk services that monitored APs that were previously online? Could you describe what happens to a service after an AP that was online then goes offline?
Hi @msproductmanager As soon there was an offline AP on the Huawei Controller no AP could be monitored anymore and the CheckMK Monitoring Module was failing. I don't have access to a similar system anymore but you should this exact behaviour in the crash report.
Hi @throjaisnn thank you for your response. Unfortunately, we cannot merge the code as it is because it modifies the function argument in place. Therefore, we will close this pull request without merging it. If possible, please suggest another solution that does not change the function argument. We also need some context on how this would affect the behavior of the services created for the APs that went offline.