checkmk icon indicating copy to clipboard operation
checkmk copied to clipboard

Allow mk_oracle to run async for multiple SID

Open gradecke opened this issue 3 years ago • 0 comments

General information

mk_oracle.ps1 runs certain sections async, to make sure it doesn't "catch up" with it's own background async jobs it saves the PID of the async job and checks if the pid is still running. This has been introduced/fixed in https://checkmk.com/de/werk/11813

Bug reports

On systems with multiple SIDs, only the first database async job is started, when the second database gets to the async code, it realizes that the PID from the async proc is still running and doesn't start. Depending on the cache settings in the check_mk.yml and the async cache settings for mk_oracle_cfg.ps1 that means only the first 1 or 2 databases ever run their async sections. The fix keeps a separate async file for each SID. The mk_oracle.ps1 plugin keeps track of each SIDs async sections and builds a new cache if the cache interval (by default 600s) is expired

This behaviour has been observed on Windows Server 2019 with a system running 8 Oracle DBs (some Oracle 12, some 19), but I think the Oracle versions and Windows Server version wouldn't matter at all.

To reproduce this bug, use a system with 2 or more databases, configure a check interval for the "Check_MK" service that is above $mk_oracle_default_async / $number_of_SIDs, i.e. 600s / 8 in our case. The check interval was set to 5 minutes, but even 2 minutes would have been enough that 3 databases would have never run their async sections.

Proposed changes

The suggested fix creates a separated async pid file for each SID. This of course has the possible disadvantage, that there will be multiple async processes that run in parallel. On our system we only observed a slight increase in memory use, but no measurable CPU usage increase. I'm not 100% sure if that is always going to be the case or if a smarter logic to distribute the load, or queue the async calls for sequential execution is necessary, but the current solution doesn't work for bigger systems.

gradecke avatar Mar 02 '22 19:03 gradecke