omd
omd copied to clipboard
Since upgrading to OMD 5.10 .pnp-internal throws intermittant errors
Not dug into this yet, however we run check_pnp_rrds against our OMD boxes and this is throwing:
WARNING: 3764 XML Files checked. 1 RRD Errors found. 0 old XML Files found
.../.pnp-internal/runtime.xml rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1676988835.000000 when last update time is 1677053640.000000 (minimum one second step)
/omd/sites/default/local/lib/nagios/plugins/check_pnp_rrds.pl -a 1 --rrdpath /media/omd/default/perfdata
is the command we use on the OMD site called default
Normally when something like this happens it's because a check is returning identically keyed perfdata like blah=1; blah=1 so I'll rule that out before digging any further.
This didn't happen under OMD 4.40, it may have been happening on the 5.0 nightly but I'm not 100% sure
We run the following OMD configure on create:
omd config set GEARMAND on
omd config set GEARMAND_PORT 0.0.0.0:4730
omd config set GEARMAN_WORKER on
omd config set LIVESTATUS_TCP on
omd config set LIVESTATUS_TCP_PORT 6557
omd config set MOD_GEARMAN on
omd config set PNP4NAGIOS gearman
omd config set THRUK_COOKIE_AUTH off
omd config set GRAFANA on
I'm 99% sure this is related to pnp4nagios being in gearman mode now
Hmm, I've not had it happen yet on a box I'm actively watching but it looks like sometimes there are multiple updates happening in close succession, sometimes. It looks like the updates actually happen randomly o.0 ; I wonder if there's a chance they collide sometimes:
OMD[default@OMDA02]:~/var/pnp4nagios/log$ tail -f perfdata.log | grep pnp-internal
tail: perfdata.log: file truncated
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.036293
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:2
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:2
2023-02-22 13:11:00 [482076] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
tail: perfdata.log: file truncated
tail: perfdata.log: file truncated
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.002062
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:1
2023-02-22 13:26:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.017532
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:58
2023-02-22 13:27:00 [486502] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
Think it's happening way more frequently and those logs are actually all the errors as the full logs look like:
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677070464:0.002735
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677070464:1
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677070464:4
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677070464:0
2023-02-22 13:58:00 [491420] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677070464.000000 when last update time is 1677074160.000000 (minimum one second step)
2023-02-22 13:58:00 [491420] [1] Gearman job end (runtime 0.000734s) ...
There are 2 perfdata gearman workers running so I think this is when they collide or something.
According to ps aux, there is actually 3 running, which I wasn't expecting:
OMD[default@OMDA01]:~$ ps aux | grep pnp
default 58580 0.0 0.1 75892 16420 ? Ss Feb21 0:00 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
default 3125308 6.8 0.1 82812 25452 ? S 14:02 0:26 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
default 3158038 6.6 0.1 82804 25496 ? S 14:05 0:12 /usr/bin/perl /omd/sites/default/lib/pnp4nagios/process_perfdata.pl --pidfile=/omd/sites/default/tmp/run/pnp_gearman_worker.pid --config=/omd/sites/default/etc/pnp4nagios/process_perfdata.cfg --gearman=0.0.0.0:4730 --daemon
I'm going to change ~/etc/pnp4nagios/process_perfdata.cfg and set PREFORK=1 instead of the default (2) and see if that clears it up.
Edit: It does not seem to
Hmm, that doesn't seem to have solved it; although we do also get it intermittantly for a few other checks and always seems to be the same ones, but the most common is definitely the .pnp-internal one
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677075707:0.030777
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677075707:1
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677075707:8
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
2023-02-22 14:54:00 [919114] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677075707:0
2023-02-22 14:54:00 [919114] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677075707.000000 when last update time is 1677077520.000000 (minimum one second step)
I'm swapping the mod_gearman_worker back to the version from 4.40 to investigate a different problem we've been seeing on a few boxes, so I'm hoping that'll rule that out (I don't think there's any overlap between those 2 problems anyway)
Just caught this with LOG_LEVEL = 2; there's only 6 seconds between updates, which is odd. I'm wondering now if this is:
- "normal" but our
.rrd's are ancient and are misconfigured for legacy reasons or something (we've had the same data for years) - something has changed around how often this happens
Log snippet:
2023-02-24 15:00:54 [3861809] [2] data2rrd called
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/RADA01/Windows_Internal_Database_Memory_Usage_WID-sqlservr_(Bytes)__.rrd 1677250853:2366865408.000000
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/RADA01/Windows_Internal_Database_Memory_Usage_WID-sqlservr_(Bytes)__.rrd updated
2023-02-24 15:00:54 [3861809] [2] data2rrd called
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250740:30.195321
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250740:1257
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250740:19927
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd updated
2023-02-24 15:00:54 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250740:0
2023-02-24 15:00:54 [3861809] [2] /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd updated
2023-02-24 15:00:54 [3861809] [1] Gearman job end (runtime 0.017263s) ...
2023-02-24 15:00:54 [3861809] [1] Gearman Worker Job start
....
2023-02-24 15:01:00 [3861809] [2] data2rrd called
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250378:0.017263
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677250378:0.017263
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677250378:1
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [2] RRDs::update --daemon=unix:/omd/sites/default/tmp/run/rrdcached.sock /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677250378:0
2023-02-24 15:01:00 [3861809] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677250378.000000 when last update time is 1677250740.000000 (minimum one second step)
2023-02-24 15:01:00 [3861809] [1] Gearman job end (runtime 0.001557s) ...
2023-02-24 15:01:00 [3861809] [1] Gearman Worker Job start
...
Just checked 3 boxes, and it seems to be a general problem. From a quick look, it seems like the errored timestamp correlates with the timestamp of the last start.
2023-02-24 08:00:03 [3968444] [0] connecting to gearmand 'localhost:4735'
2023-02-24 08:00:03 [3968445] [0] connecting to gearmand 'localhost:4735'
2023-02-24 11:06:41 [466701] [0] connecting to gearmand 'localhost:4735'
2023-02-24 11:07:00 [466701] [0] RRDs::update /omd/sites/mon/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677222003:0.008004
2023-02-24 11:07:00 [466701] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/mon/tmp/run/rrdcached.sock: illegal attempt to update using time 1677222003.000000 when last update time is 1677233102.000000 (minimum one second step)
now when i translate the 1677222003 ts to human time i get Fri Feb 24 08:00:03 2023 again.
But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself.
But i guess this does not affect normal operations.
But nothing changed there for years, so i'd assume this should happen with 4.40 as well. The only thing changed in OMD 5.x is the rrdtool itself.
Ahh interesting; that certainly shines some light on it. I'll investigate changes in rrdtool between the versions, I wonder if it's now more strict about conditions or something than it was previously.
Just to add more weight to the "timestamp lines up with when the box started", I just caught this one:
.../.pnp-internal/runtime.xml rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1676991524.000000 when last update time is 1677502140.000000 (minimum one second step)
Which is basically Tue Feb 21 14:58:44 2023 UTC vs Mon Feb 27 12:49:00 2023 UTC; which basically aligns with the uptime of the box. So now I'm thinking the problem may be somewhere within rrdcached
So now I'm thinking the problem may be somewhere within rrdcached
Personally i think the issue is in how pnp calculates and submits it's statistics.
Personally i think the issue is in how pnp calculates and submits it's statistics.
I've decorated a load of logging into store_internals, check_internals, read_internals and process_internals to see what timestamps are used etc; just waiting for an occurrence now
Aha! You are correct, it looks like this is something to do with when process_perfdata runs as a gearmanworker.
So I added a load of logging and we can see:
2023-02-27 15:16:35 [1157828] [0] connecting to gearmand '0.0.0.0:4730'
2023-02-27 15:16:35 [1157828] [0] DIAG: about to call store_internals with time: '1677510517' at time '1677510995'
2023-02-27 15:16:35 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:16:35 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958508' with '1677510517'
2023-02-27 15:16:35 [1157828] [0] DIAG: check_internals called
2023-02-27 15:16:35 [1157828] [0] DIAG: read_internals called
2023-02-27 15:16:35 [1157828] [0] DIAG: read_internals processing: 27958515
2023-02-27 15:16:35 [1157828] [0] DIAG: process_internals called: TIMET: 1677510900
2023-02-27 15:17:00 [1157828] [0] DIAG: about to call store_internals with time: '1677510995' at time '1677511020'
2023-02-27 15:17:00 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:17:00 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958516' with '1677510995'
2023-02-27 15:17:00 [1157828] [0] DIAG: check_internals called
2023-02-27 15:17:00 [1157828] [0] DIAG: read_internals called
2023-02-27 15:17:00 [1157828] [0] DIAG: read_internals processing: 27958508
2023-02-27 15:17:00 [1157828] [0] DIAG: process_internals called: TIMET: 1677510517
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_runtime.rrd 1677510517:0.002864
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_rows.rrd 1677510517:1
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_errors.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_invalid.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_skipped.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_update.rrd 1677510517:1
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:17:00 [1157828] [0] RRDs::update /omd/sites/default/var/pnp4nagios/perfdata/.pnp-internal/runtime_create.rrd 1677510517:0
2023-02-27 15:17:00 [1157828] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/default/tmp/run/rrdcached.sock: illegal attempt to update using time 1677510517.000000 when last update time is 1677510900.000000 (minimum one second step)
2023-02-27 15:18:00 [1157828] [0] DIAG: about to call store_internals with time: '1677511020' at time '1677511080'
2023-02-27 15:18:00 [1157828] [0] DIAG: store_internals called from: 190
2023-02-27 15:18:00 [1157828] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/27958517' with '1677511020'
2023-02-27 15:18:00 [1157828] [0] DIAG: check_internals called
2023-02-27 15:18:00 [1157828] [0] DIAG: read_internals called
So we can see that pnp4nagios explicitly writes out the internal stats as 1677510517 on one of the iterations and then 1 minute later, tries to use that timestamp. The value of 1677510517 is approximately when the pnp_gearman_worker was started.
The 2023-02-27 15:17:00 [1157828] [0] DIAG: about to call store_internals with time: '1677510995' at time '1677511020' log is position here:
https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L191
So as far as I see it, the problem is that $stats{timet} isn't being set on startup of the gearman worker, so the first run results in it writing things out with the wrong timestamp. I can't see a way of setting that to be the right value at the right time as it sort of seems like here: https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L1518 "before" the worker function is run, we want to set $stats{timet} = time for that given instance of the worker.
Yeah, I think I've confirmed that by initialising %stats to have timet=0;
It looks like every time that a new child is created, it has 0 has the time value.
2023-02-27 17:11:33 [1195962] [0] starting new child (running = 0)
2023-02-27 17:11:33 [1202144] [0] connecting to gearmand '0.0.0.0:4730'
2023-02-27 17:11:33 [1202144] [0] DIAG: about to call store_internals with time: '0' at time '1677517893'
2023-02-27 17:11:33 [1202144] [0] DIAG: store_internals called from: 190
2023-02-27 17:11:33 [1202144] [0] DIAG: store_internals called from: 190 wrote out: '/omd/sites/default/tmp/pnp4nagios/stats/0' with '0'
2023-02-27 17:11:33 [1202144] [0] DIAG: check_internals called
2023-02-27 17:11:33 [1202144] [0] DIAG: read_internals called
2023-02-27 17:11:33 [1202144] [0] DIAG: read_internals processing: 27958630
2023-02-27 17:11:33 [1202144] [0] DIAG: process_internals called: TIMET: 1677517800
So the temptation is to default it to 0 but not actually do the writing out of the stats (by skipping it https://github.com/pnp4nagios/pnp4nagios/blob/master/scripts/process_perfdata.pl.in#L190) as it's the first sample of the runner and the numbers etc will be of limited value anyway? Not sure
OK I've got a fix, but this patch doesn't seem to be in the right format. How do I generate one that looks like this: https://github.com/ConSol/omd/blob/labs/packages/pnp4nagios/patches/158-srv-zoom-fix.patch?
fix_rrd_errors_timestamp.patch
I still don't know why this has only just become a problem, I'm theorising that rrdtool has got more strict about erroring about older updates or something
PR for the fix/patch: https://github.com/ConSol/omd/issues/155