omd icon indicating copy to clipboard operation
omd copied to clipboard

Since upgrading to OMD 5.10 .pnp-internal throws intermittant errors

Open infraweavers opened this issue 2 years ago • 15 comments

Not dug into this yet, however we run check_pnp_rrds against our OMD boxes and this is throwing:

WARNING: 3764 XML Files checked. 1 RRD Errors found. 0 old XML Files found
.../.pnp-internal/runtime.xml rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1676988835.000000 when last update time is 1677053640.000000 (minimum one second step)
/omd/sites/default/local/lib/nagios/plugins/check_pnp_rrds.pl -a 1 --rrdpath /media/omd/default/perfdata

is the command we use on the OMD site called default

Normally when something like this happens it's because a check is returning identically keyed perfdata like blah=1; blah=1 so I'll rule that out before digging any further.

This didn't happen under OMD 4.40, it may have been happening on the 5.0 nightly but I'm not 100% sure

We run the following OMD configure on create:

    omd config set GEARMAND on
    omd config set GEARMAND_PORT 0.0.0.0:4730
    omd config set GEARMAN_WORKER on
    omd config set LIVESTATUS_TCP on
    omd config set LIVESTATUS_TCP_PORT 6557
    omd config set MOD_GEARMAN on
    omd config set PNP4NAGIOS gearman
    omd config set THRUK_COOKIE_AUTH off
    omd config set GRAFANA on

I'm 99% sure this is related to pnp4nagios being in gearman mode now

infraweavers avatar Feb 22 '23 08:02 infraweavers

Hmm, I've not had it happen yet on a box I'm actively watching but it looks like sometimes there are multiple updates happening in close succession, sometimes. It looks like the updates actually happen randomly o.0 ; I wonder if there's a chance they collide sometimes:

OMD[default@OMDA02]:~/var/pnp4nagios/log$ tail -f perfdata.log  | grep pnp-internal

tail: perfdata.log: file truncated
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.036293
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:2
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:2
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
tail: perfdata.log: file truncated
tail: perfdata.log: file truncated
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.002062
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:1
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.017532
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:58
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0

infraweavers avatar Feb 22 '23 13:02 infraweavers

Think it's happening way more frequently and those logs are actually all the errors as the full logs look like:

2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.002735
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:4
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [1] Gearman job end (runtime 0.000734s) ...

There are 2 perfdata gearman workers running so I think this is when they collide or something.

According to ps aux, there is actually 3 running, which I wasn't expecting:

OMD[default@OMDA01]:~$ ps aux | grep pnp
default    58580  0.0  0.1  75892 16420 ?        Ss   Feb21   0:00 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
default  3125308  6.8  0.1  82812 25452 ?        S    14:02   0:26 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
default  3158038  6.6  0.1  82804 25496 ?        S    14:05   0:12 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon

infraweavers avatar Feb 22 '23 14:02 infraweavers

I'm going to change ~/etc/pnp4nagios/process_perfdata.cfg and set PREFORK=1 instead of the default (2) and see if that clears it up.

Edit: It does not seem to

infraweavers avatar Feb 22 '23 14:02 infraweavers

Hmm, that doesn't seem to have solved it; although we do also get it intermittantly for a few other checks and always seems to be the same ones, but the most common is definitely the .pnp-internal one

2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677075707:0.030777
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677075707:1
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677075707:8
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)

infraweavers avatar Feb 22 '23 15:02 infraweavers

I'm swapping the mod_gearman_worker back to the version from 4.40 to investigate a different problem we've been seeing on a few boxes, so I'm hoping that'll rule that out (I don't think there's any overlap between those 2 problems anyway)

infraweavers avatar Feb 24 '23 10:02 infraweavers

Just caught this with LOG_LEVEL = 2; there's only 6 seconds between updates, which is odd. I'm wondering now if this is:

  • "normal" but our .rrd's are ancient and are misconfigured for legacy reasons or something (we've had the same data for years)
  • something has changed around how often this happens

Log snippet:

2023-02-24 15:00:54 [3861809] [2] data2rrd called
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/RADA01/Windows_Internal_Database_Memory_Usage_WID-sqlservr_(Bytes)__.rrd 1677250853:2366865408.000000
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/RADA01/Windows_Internal_Database_Memory_Usage_WID-sqlservr_(Bytes)__.rrd updated
2023-02-24 15:00:54 [3861809] [2] data2rrd called
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250740:30.195321
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250740:1257
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250740:19927
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd updated
2023-02-24 15:00:54 [3861809] [1] Gearman job end (runtime 0.017263s) ...
2023-02-24 15:00:54 [3861809] [1] Gearman Worker Job start

....

2023-02-24 15:01:00 [3861809] [2] data2rrd called
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250378:0.017263
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250378:0.017263
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [1] Gearman job end (runtime 0.001557s) ...
2023-02-24 15:01:00 [3861809] [1] Gearman Worker Job start

...

infraweavers avatar Feb 24 '23 15:02 infraweavers

Just checked 3 boxes, and it seems to be a general problem. From a quick look, it seems like the errored timestamp correlates with the timestamp of the last start.

2023-02-24 08:00:03 [3968444] [0] connecting to gearmand 'localhost:4735'
2023-02-24 08:00:03 [3968445] [0] connecting to gearmand 'localhost:4735'
2023-02-24 11:06:41 [466701] [0] connecting to gearmand 'localhost:4735'
2023-02-24 11:07:00 [466701] [0] RRDs::update /omd/sites/mon/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677222003:0.008004
2023-02-24 11:07:00 [466701] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/mon/tmp/run/rrdcached.sock: illegal attempt to update using time 1677222003.000000 when last update time is 1677233102.000000 (minimum one second step)

now when i translate the 1677222003 ts to human time i get Fri Feb 24 08:00:03 2023 again.

But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself.

But i guess this does not affect normal operations.

sni avatar Feb 24 '23 15:02 sni

But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself.

Ahh interesting; that certainly shines some light on it. I'll investigate changes in rrdtool between the versions, I wonder if it's now more strict about conditions or something than it was previously.

infraweavers avatar Feb 24 '23 15:02 infraweavers

Just to add more weight to the "timestamp lines up with when the box started", I just caught this one:

.../.pnp-internal/runtime.xml rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1676991524.000000 when last update time is 1677502140.000000 (minimum one second step)

Which is basically Tue Feb 21 14:58:44 2023 UTC vs Mon Feb 27 12:49:00 2023 UTC; which basically aligns with the uptime of the box. So now I'm thinking the problem may be somewhere within rrdcached

infraweavers avatar Feb 27 '23 12:02 infraweavers

So now I'm thinking the problem may be somewhere within rrdcached

Personally i think the issue is in how pnp calculates and submits it's statistics.

sni avatar Feb 27 '23 13:02 sni

Personally i think the issue is in how pnp calculates and submits it's statistics.

I've decorated a load of logging into store_internals, check_internals, read_internals and process_internals to see what timestamps are used etc; just waiting for an occurrence now

infraweavers avatar Feb 27 '23 14:02 infraweavers

Aha! You are correct, it looks like this is something to do with when process_perfdata runs as a gearmanworker.

So I added a load of logging and we can see:

2023-02-27 15:16:35 [1157828] [0] connecting to gearmand '0.0.0.0:4730'
2023-02-27 15:16:35 [1157828] [0] DIAG: about to call store_internals with time: '1677510517' at time '1677510995'
2023-02-27 15:16:35 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:16:35 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958508' with '1677510517'
2023-02-27 15:16:35 [1157828] [0] DIAG: check_internals called
2023-02-27 15:16:35 [1157828] [0] DIAG: read_internals called
2023-02-27 15:16:35 [1157828] [0] DIAG: read_internals processing: 27958515
2023-02-27 15:16:35 [1157828] [0] DIAG: process_internals called: TIMET: 1677510900
2023-02-27 15:17:00 [1157828] [0] DIAG: about to call store_internals with time: '1677510995' at time '1677511020'
2023-02-27 15:17:00 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:17:00 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958516' with '1677510995'
2023-02-27 15:17:00 [1157828] [0] DIAG: check_internals called
2023-02-27 15:17:00 [1157828] [0] DIAG: read_internals called
2023-02-27 15:17:00 [1157828] [0] DIAG: read_internals processing: 27958508
2023-02-27 15:17:00 [1157828] [0] DIAG: process_internals called: TIMET: 1677510517
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677510517:0.002864
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677510517:1
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677510517:1
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:18:00 [1157828] [0] DIAG: about to call store_internals with time: '1677511020' at time '1677511080'
2023-02-27 15:18:00 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:18:00 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958517' with '1677511020'
2023-02-27 15:18:00 [1157828] [0] DIAG: check_internals called
2023-02-27 15:18:00 [1157828] [0] DIAG: read_internals called

So we can see that pnp4nagios explicitly writes out the internal stats as 1677510517 on one of the iterations and then 1 minute later, tries to use that timestamp. The value of 1677510517 is approximately when the pnp_gearman_worker was started.

The 2023-02-27 15:17:00 [1157828] [0] DIAG: about to call store_internals with time: '1677510995' at time '1677511020' log is position here: https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L191

So as far as I see it, the problem is that $stats{timet} isn't being set on startup of the gearman worker, so the first run results in it writing things out with the wrong timestamp. I can't see a way of setting that to be the right value at the right time as it sort of seems like here: https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L1518 "before" the worker function is run, we want to set $stats{timet} = time for that given instance of the worker.

infraweavers avatar Feb 27 '23 15:02 infraweavers

Yeah, I think I've confirmed that by initialising %stats to have timet=0;

It looks like every time that a new child is created, it has 0 has the time value.

2023-02-27 17:11:33 [1195962] [0] starting new child (running = 0)
2023-02-27 17:11:33 [1202144] [0] connecting to gearmand '0.0.0.0:4730'
2023-02-27 17:11:33 [1202144] [0] DIAG: about to call store_internals with time: '0' at time '1677517893'
2023-02-27 17:11:33 [1202144] [0] DIAG: store_internals called from: 190
2023-02-27 17:11:33 [1202144] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/0' with '0'
2023-02-27 17:11:33 [1202144] [0] DIAG: check_internals called
2023-02-27 17:11:33 [1202144] [0] DIAG: read_internals called
2023-02-27 17:11:33 [1202144] [0] DIAG: read_internals processing: 27958630
2023-02-27 17:11:33 [1202144] [0] DIAG: process_internals called: TIMET: 1677517800

So the temptation is to default it to 0 but not actually do the writing out of the stats (by skipping it https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L190) as it's the first sample of the runner and the numbers etc will be of limited value anyway? Not sure

infraweavers avatar Feb 27 '23 17:02 infraweavers

OK I've got a fix, but this patch doesn't seem to be in the right format. How do I generate one that looks like this: https://github.com/ConSol/omd/blob/labs/packages/pnp4nagios/patches/158-srv-zoom-fix.patch?

fix_rrd_errors_timestamp.patch

I still don't know why this has only just become a problem, I'm theorising that rrdtool has got more strict about erroring about older updates or something

infraweavers avatar Feb 28 '23 09:02 infraweavers

PR for the fix/patch: https://github.com/ConSol/omd/issues/155

infraweavers avatar Mar 01 '23 09:03 infraweavers