freeipa-healthcheck icon indicating copy to clipboard operation
freeipa-healthcheck copied to clipboard

Intermittent replication errors when running ipa-healthcheck

Open Kivernitas opened this issue 2 years ago • 9 comments

Issue

Intermittent replication errors when running ipa-healthcheck. Running ipa-healthcheck every x minutes provides unreliable ReplicationChecks results. From what I've read on https://access.redhat.com/solutions/359683, getting a "replica is busy" is considered "normal". This make it difficult to monitor for actual replication errors.

Actual behaviour

  {
    "source": "ipahealthcheck.ds.replication",
    "check": "ReplicationCheck",
    "result": "ERROR",
    "uuid": "94548c4b-ca49-4f8a-bd2e-1953fba9f767",
    "when": "20230103141508Z",
    "duration": "0.304435",
    "kw": {
      "key": "DSREPLLE0003",
      "items": [
        "Replication",
        "Agreement"
      ],
      "msg": "The replication agreement (ipa-2.test.io-to-ipa-3.test.io) under \"dc=test,dc=io\" is not in synchronization.\nStatus message: error (1) can't acquire busy replica (unable to acquire replica: the replica is currently being updated by another supplier.)"
    }

Similar to the above error can happen intermittently on every freeipa server on a 3 node cluster. There aren't any replication errors most of the time.

Expected behavior

It should not report an error. A warning would be more suitable.

Version/Release/Distribution

Rocky Linux 8.6
Source : ipa-healthcheck-0.7-14.module+el8.7.0+1075+05db0c1d.src.rpm (latest available)
FreeIPA: 4.9

Kivernitas avatar Jan 03 '23 15:01 Kivernitas

This check is provided by 389 itself. I suppose we could consider reducing the severity to WARNING but I'd leave that as a call to them. @mreynolds389 what do you think?

rcritten avatar Jan 05 '23 18:01 rcritten

This check is provided by 389 itself. I suppose we could consider reducing the severity to WARNING but I'd leave that as a call to them. @mreynolds389 what do you think?

Well it is a transient error. Replication is just busy at that time. If you run it again in a few seconds it will probably pass. For us we already set it to a "medium" severity.

mreynolds389 avatar Jan 05 '23 20:01 mreynolds389

Thanks both for replying!

Yes it's a transient error. We run ipahealthcheck_exporter which basically scrapes ipa-healthcheck logs every 5 minutes. Can you suggest an alternative way of verifying replication health?

@mreynolds389 you mentioned you set it to "medium" severity, could I ask how?

Kivernitas avatar Jan 09 '23 09:01 Kivernitas

Thanks both for replying!

Yes it's a transient error. We run ipahealthcheck_exporter which basically scrapes ipa-healthcheck logs every 5 minutes. Can you suggest an alternative way of verifying replication health?

@mreynolds389 you mentioned you set it to "medium" severity, could I ask how?

Well IPA is using DS's lib389 library for the DS healthchecks. IPA does not use DS's healthecheck severity level - it is ignored because there are basically two tools that were merged.

mreynolds389 avatar Jan 09 '23 12:01 mreynolds389

@rcritten Since IPA does not use DS's healthcheck severity level could this checks severity level be lowered to WARNING in IPA?

rexberg avatar May 15 '24 07:05 rexberg

healthcheck doesn't ignore the DS severity. It converts it. See https://github.com/freeipa/freeipa-healthcheck/issues/283#issuecomment-2111803800

"medium" from DS is converted into a ipa-healthcheck ERROR severity.

rcritten avatar May 15 '24 12:05 rcritten

healthcheck doesn't ignore the DS severity. It converts it. See #283 (comment)

"medium" from DS is converted into a ipa-healthcheck ERROR severity.

Thanks for clarifying. Do we want to set this specific check's severity to WARNING bypassing the conversion? As mentioned it is a transient error but it is still triggering a ERROR severity.

rexberg avatar May 15 '24 14:05 rexberg

I suppose it's possible but it would be an ugly one-off. healthcheck has a rather thin wrapper to call the 389 checks and then re-format the return value. It's very generic code. It would be invasive to put in a test for a specific check.

rcritten avatar May 15 '24 17:05 rcritten

I looked at the code and would assume as much and I tend to agree. Currently we exclude this specific check since we can't really "trust" the ERROR trigger.

rexberg avatar May 16 '24 11:05 rexberg