icingaweb2-module-vspheredb
icingaweb2-module-vspheredb copied to clipboard
vpxd crashs after too many connections
Expected Behavior
When the API call "retrievePropertiesEx" is made, check if the result is truncated. Use "continueRetrievePropertiesEx" to continue and "cancelRetrievePropertiesEx" to cancel the request
Current Behavior
The API call "retrievePropertiesEx" is made too often, causing vpxd to crash
Possible Solution
After contact to VMWare, the solution should be: The fix is to make sure that when they call retrievePropertiesEx they check for whether the result was truncated and if so either use continueRetrievePropertiesEx as many times as needed to retrieve the remainder of the result or use cancelRetrievePropertiesEx to discard it. Details: https://vdc-repo.vmware.com/vmwb-repository/dcr-public/1ef6c336-7bef-477d-b9bb-caa1767d7e30/82521f49-9d9a-42b7-b19b-9e6cd9b30db1/vmodl.query.PropertyCollector.html
Steps to Reproduce (for bugs)
Just keep the damon running and wait for the results
Your Environment
- VMware vCenter®/ESXi™-Version: 7.0U3d
- Version/GIT-Hash of this module: 1.4.0 / 0e7eda67c27d9d76920ca61982ae76604059fdb5
- Icinga Web 2 version: 2.10.1 / 974729a6421c17fdb8bb1931623107cf6a90fc7e
- Operating System and version: CentOS 7.9 / RHEL 7
- Webserver, PHP versions: Apache/2.4.6, PHP 7.3.29
When the API call "retrievePropertiesEx" is made, check if the result is truncated. Use "continueRetrievePropertiesEx" to continue and "cancelRetrievePropertiesEx" to cancel the request
That's correct, that's how we modeled our related code.
The API call "retrievePropertiesEx" is made too often, causing vpxd to crash
While I believe that your vpxc crashes, I do not (yet) understand, how this could be triggered by our module. There is only one single place in our code calling RetrievePropertiesEx. If you follow the logic in that file, you'll see that fetchFullResult() leads to ContinueRetrievePropertiesEx calls, as long as the RetrieveResult has more results, which is a simple check for a token in the result.
To me, this looks exactly like what has been asked for:
...check for whether the result was truncated and if so either use continueRetrievePropertiesEx as many times as needed to retrieve the remainder of the result or use cancelRetrievePropertiesEx to discard it...
We never call CancelRetrievePropertiesEx, as we always want to fetch the full result.
To track this down, please take the following steps:
- in case you're sharing the very same VMware user among multiple tools (e.g. Icinga check commands), please create a dedicated user for the Icinga vSphereDB module, configure it and verify, whether the problem is now being triggered by this user
- stop the background daemon and run it in the foreground:
icingacli vspheredb daemon run --debug --trace
. Keep it running for at least 15 minutes and share the (anonymized) log output. I'd like to check whether there are any jobs starting and not returning at all
Thanks, Thomas
@Thomas-Gelf Hey Thomas,
we had the same Issue and opened a ticket at VMware through our premium support contract. Here is the answere from VMware itself:
"It appears that Icinga is using a single property collector with a huge number of retrievers, thus hogging memory, which is most probably the reason for vpxd running OOM.
Please ask the customer to turn off their Icinga Monitoring on 10.10.183.216 and to monitor if vpxd keeps running stable. If that fixes the issue, please advise the customer to involve the Icinga support.
In the unexpected case that this does however not help, please provide a PR template."
They way you parse the Data cause a OOM and because of this VPXD crashes. In lager environment this Plugin cause massiv problems in VSphere 7x+.
This need to be fixed.
Greetings
Lukas