librenms
librenms copied to clipboard
Mysql graphs randomly broken with different version of mysql.
The problem
Hi, we are running many instances (about 30-40) of librenms based on the official OVA image provided on the website with ubuntu 18.04 when we began . These virtual machines were updated regularly, they are now running the version 20.04 LTS of Ubuntu operating system. After some time we observed that Mysql graphs were randomly and definitively broken with NaN values. We have many Mysql servers (virtual and physical machine with many versions from 5.5 to 8, standalone, master-slave and clustered (Percona XtraDnb), it seems the graph issue occurs with all versions or configurations randomly even for the localhost mysql himself.
First we managed to eliminate all hardware issue by trying different physical server for LibreNMS and also multiple high end hardware (AMD EPYC, 256Go RAM ECC, Datacenter SSD). Conclusion: the problem is not related to any specific hardware.
Then we tried to find the precise location of this issue through all step of data processing. We started with the php script collecting mysql values on the mysql server and finished with looking inside the rrd files produced by LibreNMS. but we couldn't find any malfunction
- The php script that collect mysql data located in /etc/snmp/script/mysql operates properly (as snmp user) and return values.
- The snmpd daemon of a remote server (mysql) transmit correct values when requested with snmpget command executed inside a bash prompt directly from a LibreNMS.
- The LibreNMS poller in debug mode shows the correct rrd command.
- The rrdcommand executed manually exit with 0 error.
But there is no data (NaN) in the rrdfile.
When the problem occur, deleting and recreating the mysql application( and purging the rrdfiles) does not solve the issue. But stopping rrdcached before purging the rrdfiles and restarting it, solved this graph issue for a random period of time! With some mysql hosts, rebooting and restarting mysqld also solved the issue for some time, but in many case the graph stay in a broken state!
Restarting rrcached service did't help either.
Later we found a way looking at values sent into rrdfiles . The poller for the app mysql add datasets with max value of 1.25e11. If some values exceed this max value, the rrdfiles seem to be filled of NaN. Tuning the rrdfile to unlimited was not a success.
Finally we tried a dirty patch limitting the max value sent by the /etc/snmp/scripts/mysql script to 12499999999 when >=125000000000 This dirty patch solved the issue on all our mysql graphs in all LibreNMS! (Please find below a copy/paste of the dirty patch) This patch show no apparent problem, it seems the data exceeding the 1.25e11 value wont be used to process mysql graphs. Are those large values just stored into the rrdfile?. Is this happening after one year or more of production ? We don't know.
We are not experts about librenms and rrd files So would really appreciate any feedback from someone with in depth knowlegde of the internal relation between librenms and the php script ( /etc/snmp/script/mysql ) which collect mysql data through snmp request.
Output of ./validate.php
/opt/librenms/validate.php
====================================
Component | Version
--------- | -------
LibreNMS | 21.6.0
DB Schema | 2021_25_01_0127_create_isis_adjacencies_table (210)
PHP | 7.4.16
Python | 3.8.10
MySQL | 10.3.31-MariaDB-0ubuntu0.20.04.1
RRDTool | 1.7.2
SNMP | NET-SNMP 5.8
====================================
[OK] Composer Version: 2.1.6
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
What was the last working version of LibreNMS?
No response
Anything in the logs that might be useful for us?
No response
The dirty patch:
160 list($short, $val) = explode(":", $item); 161 if ($val > 124999999999) { 162 $val = 124999999999; 163 } 164 echo(strtolower($short).":".strtolower($val)."\n"); 165 } 166 debug(array("Final result", $output));
If you look at https://github.com/librenms/librenms/blob/master/includes/polling/applications/mysql.inc.php#L118 the max is set there... probably a fix would include setting that to a larger number or unlimited. That would only fix new rrd files.
Likely, your counters have just reached higher numbers after running for a certain amount of time. You should be able to dump your rrd files, modify them and reimport: https://www.winni.at/index.php/de/kb/82-rrdtool-edit-rrd-file-export-and-import
I have the same problem and tried to set the maximum in mysql.inc.php. I tried with "U" for unlimited and "1250000000000000000". Each time I deleted the device from LibreNMS and added again. But it showed NaN in the graph in both cases.
I tried the "dirty patch" from @ofontes and it works: I see graphs now. But I am not really satisfied with this solution. I dont understand why raising the limit does not work.
This issue has been mentioned on LibreNMS Community. There might be relevant details there:
https://community.librenms.org/t/mysql-application-graph-showing-nan/16770/18
Nice, i have noticed this issue too, but was only monitoring one mysql server, but indeed it seems you found where the issue is.
This issue has been mentioned on LibreNMS Community. There might be relevant details there:
https://community.librenms.org/t/no-graphs-for-mysql-application/22866/1