icingaweb2-module-vspheredb icon indicating copy to clipboard operation
icingaweb2-module-vspheredb copied to clipboard

vpxd crashs after too many connections

Open TheRealKingS opened this issue 2 years ago • 2 comments

Expected Behavior

When the API call "retrievePropertiesEx" is made, check if the result is truncated. Use "continueRetrievePropertiesEx" to continue and "cancelRetrievePropertiesEx" to cancel the request

Current Behavior

The API call "retrievePropertiesEx" is made too often, causing vpxd to crash

Possible Solution

After contact to VMWare, the solution should be: The fix is to make sure that when they call retrievePropertiesEx they check for whether the result was truncated and if so either use continueRetrievePropertiesEx as many times as needed to retrieve the remainder of the result or use cancelRetrievePropertiesEx to discard it. Details: https://vdc-repo.vmware.com/vmwb-repository/dcr-public/1ef6c336-7bef-477d-b9bb-caa1767d7e30/82521f49-9d9a-42b7-b19b-9e6cd9b30db1/vmodl.query.PropertyCollector.html

Steps to Reproduce (for bugs)

Just keep the damon running and wait for the results

Your Environment

  • VMware vCenter®/ESXi™-Version: 7.0U3d
  • Version/GIT-Hash of this module: 1.4.0 / 0e7eda67c27d9d76920ca61982ae76604059fdb5
  • Icinga Web 2 version: 2.10.1 / 974729a6421c17fdb8bb1931623107cf6a90fc7e
  • Operating System and version: CentOS 7.9 / RHEL 7
  • Webserver, PHP versions: Apache/2.4.6, PHP 7.3.29

TheRealKingS avatar May 19 '22 06:05 TheRealKingS

When the API call "retrievePropertiesEx" is made, check if the result is truncated. Use "continueRetrievePropertiesEx" to continue and "cancelRetrievePropertiesEx" to cancel the request

That's correct, that's how we modeled our related code.

The API call "retrievePropertiesEx" is made too often, causing vpxd to crash

While I believe that your vpxc crashes, I do not (yet) understand, how this could be triggered by our module. There is only one single place in our code calling RetrievePropertiesEx. If you follow the logic in that file, you'll see that fetchFullResult() leads to ContinueRetrievePropertiesEx calls, as long as the RetrieveResult has more results, which is a simple check for a token in the result.

To me, this looks exactly like what has been asked for:

...check for whether the result was truncated and if so either use continueRetrievePropertiesEx as many times as needed to retrieve the remainder of the result or use cancelRetrievePropertiesEx to discard it...

We never call CancelRetrievePropertiesEx, as we always want to fetch the full result.

To track this down, please take the following steps:

  • in case you're sharing the very same VMware user among multiple tools (e.g. Icinga check commands), please create a dedicated user for the Icinga vSphereDB module, configure it and verify, whether the problem is now being triggered by this user
  • stop the background daemon and run it in the foreground: icingacli vspheredb daemon run --debug --trace. Keep it running for at least 15 minutes and share the (anonymized) log output. I'd like to check whether there are any jobs starting and not returning at all

Thanks, Thomas

Thomas-Gelf avatar May 19 '22 06:05 Thomas-Gelf

@Thomas-Gelf Hey Thomas,

we had the same Issue and opened a ticket at VMware through our premium support contract. Here is the answere from VMware itself:

"It appears that Icinga is using a single property collector with a huge number of retrievers, thus hogging memory, which is most probably the reason for vpxd running OOM.

Please ask the customer to turn off their Icinga Monitoring on 10.10.183.216 and to monitor if vpxd keeps running stable. If that fixes the issue, please advise the customer to involve the Icinga support.

In the unexpected case that this does however not help, please provide a PR template."

They way you parse the Data cause a OOM and because of this VPXD crashes. In lager environment this Plugin cause massiv problems in VSphere 7x+.

This need to be fixed.

Greetings

Lukas

lokidaibel avatar Sep 13 '22 12:09 lokidaibel