Develop a pluggable logging API for Cacti to be able to log to alternate logging formats
Feature Request
Is your feature request related to a problem? Please describe
Support extensible logging to Splunk, Kafka, Elastic, etc.
Describe the solution you'd like
We need to determine the minimum logging attributes that are required to encompass these various logging solutions and then define the JSON, XML, etc. structures that are the minimum to perform such logging, and finally an API and class library that extends the output format using a pluggable and extensible output class function.
Describe alternatives you've considered
Continuing one size fit's all logging format.
Additional context
This is an R&D project that should deliver a report on what should happen vs. simply coding an enhancement. So, it's a call for a design document.
Thoughts about capabilities and formats
The local logging must be unified
- each log line must have the same structure
- each log line must have the same minimum information
- log lines need to be easily extendable with additional information (for example for debugging)
- adding verbosity should not alter the start of the lines, so that they can still be parsed by existing functions
- a common log line format must be defined, and several elements need a static position and a separator that can be clearly identified
- suggestion for a local-file log format:
"<timestamp> <section/facility>:<subsection> - <log message> -- <key/value pairs> == <optional info>" - for standard syslog, we should try to match RFC5424 conformity (logs with structured data) -> https://datatracker.ietf.org/doc/html/rfc5424#section-6.5 example 3
Considerations for configuration
- make it flexible
- keep it simple (too many options might confuse the users - maybe add "expert" section?)
- global options to enable/disable specific elements per section
- log levels should be configurable at least per facility or per code-section
- optional information needs to be configurable (for example: line numbers and code portions are usually only needed for debugging)
Additional considerations:
- there are several logging functions in different parts of cacti, this needs some unification
- allow more detailed logging in a specific section, for example when there is something with boost -> switch on boost logging with this and that severity
- transfer in output format should be done as final step (keep the logging item in a data structure as long as possible, separated by identifiers mentioned below)
- we might need to think about the general concept of logging into different files. Is that really necessary or wouldn't it be better to keep everything in one file? And if we want to keep supporting multiple files, which information shall go into which file?
log items/identifiers
Minimum/mandatory set of information (identifiers)
- timestamp in epoch or UTC (local time conversion later)
- facility/section inside cacti (i.e. Spine, cmdphp, Thresholds, Core, Automate, SQL, ...)
- subsection (TBD)
- severity (similar to syslog: debug=7, info, notice, warning, error, critical, alert, emergency=0)
- message
Optional items (where applicable / during debugging)
- php-file in which the log was produced
- line number
- function name
- object that produced the log
- The message can include additional key/value pairs
- let's keep the old bracket-notation (like: host_id[42])
- if possible, make the keys identical to the DB-column names they refer to
Collection of possible (sub)sections:
- [main] GUI operations
- [main] Boost
- [main] Polling (spine + cmdphp)
- [main] workflow/controlling
- [sub] database accesses
- [sub] command calls
- [sub] remote poller communication @TheWitness @netniV please contribute here, you know the code and its structure better
support for output plug-ins
The logging function should act as a hub for plug-ins that allow logging to not only a local file but also other destinations. Possible output plug-ins:
- local file (default plug-in to keep current functionality)
- Config: file location(s)m, rotation settings [hourly/daily/size-based, compression, archive folder structure]
- log-file should immediately have a creation-timestamp in its name
- local syslog (default plug-in to keep current functionality)
- Config: facility
- remote syslog (optional plug-in)
- Config: target-server, protocol, port, facility, CA
- Elasticsearch (optional plug-in, TBD)
- kafka bus (optional plug-in, TBD)
- other plug-ins (community-provided)
Thoughts about implementation phases
prepare the refactoring
- create a new specific library-file:
/lib/logging.php - put all new logging-related functions into that file
- create new core-function which must be used in future for all logging purposes
- create a set of logging configuration in "settings" table: logging_
= - create new GUI config elements for the new logging settings + code and sanity checks around them
- keep the current logging-function available (old plug-ins should be supported as long as possible)
- use the current code as source for analyzing the logging internals
- rebuild the basic functionality in the new function, add all planned features
- make the old code call the new code
- old code then acts as meta-code for the transition to new logging
- think about expiring old logging at some point in future
reforming the logging concept
- create logging initialization procedure
- if possible/applicable, called only once during start of the instance
- initializes all logging-related variables
- checks if log-files & folders are available and creates them if needed
- long running processes might need a reload/re-init functionality to allow change of log levels (for example during a boost run)
- include the new core logging lib file everywhere, where logging is needed (which is basically everywhere)
- Start refactoring all "cacti_log" calls, replace them with the new core lib calls
Whoever has ideas that need to be considered, please speak up! :) This is an open discussion to take into account as many thoughts as possible from the start. I will keep adding the information in compact form to my initial post.
Subsection also for plug-in modules. It could be useful if I can filter messages only from a specific plugin, for example if I have debug enabled for it
I hope to provide some comments next week.
All, I've had a detailed conversation with @Linegod whose day job has everything to do with logging. Let me summarize here where we aligned. I will do this in code form as well as descriptive.
- KISS - Yup, make it easy.
- Modify the
cacti_log()function take an array in addition to a string as input. - For legacy logging, use the special tags that we used today DS[], Device[], GRAPH[] from legacy messages and break that into an object if the input to the function is a string.
- If the input to the function is an object/array, then allow logging in JSON and whatever format is applicable to the output function. @Linegod says it's pretty much all JSON these days, with some variations on the timestamp format.
- The plugin dictates things like topic, endpoint ip address, and auth data. The format could simply be a JSON file defining the plugin vs. a table. That's up for discussion.
Now for my position. One way or another, the cacti.log file will remain the same, emulating the tags already defined regardless of the input type.
We will discuss more tomorrow in our call.
Larry
We could store the plugins in a table as a name value pair of course, with a pre-defined JSON structure holding the important columns as above. This is pretty popular these days with people using name value stores such as reddis, but why not the Cacti settings table (just kidding), we will create it it's own table format and cache the document at the session level. It'll be stored in JSON. I guess the 'template' could be stored on disk. Let's discuss.
So, I would propose something like the following as a default object structure, it can of course be increased by caller:
$message = array(
'timestamp' => unix timestamp in integer or millisecond format (optional),
'message' => text,
'level' => LOW|MED|HIGH|DEBUG|...,
'hostname' => either Cacti or the device I'm open,
'app' => third parameter on the current logging facility,
'metadata' => array(
'thing' => $value,
...
)
);
Then, as additional optional metadata that we know of today that can be reverse engineered into the object structure and currently captured in the text_get_regex_array() function, we have:
$regex_array = array(
1 => array('name' => 'DS', 'regex' => '( DS\[)([, \d]+)(\])', 'func' => 'text_regex_datasource'),
2 => array('name' => 'DQ', 'regex' => '( DQ\[)([, \d]+)(\])', 'func' => 'text_regex_dataquery'),
3 => array('name' => 'Device', 'regex' => '( Device\[)([, \d]+)(\])', 'func' => 'text_regex_device'),
4 => array('name' => 'Poller', 'regex' => '( Poller\[)([, \d]+)(\])', 'func' => 'text_regex_poller'),
5 => array('name' => 'RRA', 'regex' => "([_\/])(\d+)(\.rrd')", 'func' => 'text_regex_rra'),
6 => array('name' => 'GT', 'regex' => '( GT\[)([, \d]+)(\])', 'func' => 'text_regex_graphtemplates'),
7 => array('name' => 'Graph', 'regex' => '( Graph\[)([, \d]+)(\])', 'func' => 'text_regex_graphs'),
8 => array('name' => 'Graphs', 'regex' => '( Graphs\[)([, \d]+)(\])', 'func' => 'text_regex_graphs'),
9 => array('name' => 'User', 'regex' => '( User\[)([, \d]+)(\])', 'func' => 'text_regex_users'),
10 => array('name' => 'User', 'regex' => '( Users\[)([, \d]+)(\])', 'func' => 'text_regex_users'),
11 => array('name' => 'Rule', 'regex' => '( Rule\[)([, \d]+)(\])', 'func' => 'text_regex_rule'),
);
I noticed a few things missing from the 'standard' like Site[], Location[], Thold[], Alert[] for example. These would all be captured under metadata. At least that is where my head is at.
If we follow that structure, we can provide a human readable message in the cacti.log file, as well as JSON to any of the feed types.
@Linegod please jump in. Is this kind of where your head was too?
@TheWitness - yes, that's pretty much what I was thinking.
A little bit of digging lead me this though: https://github.com/Seldaek/monolog
Most of the heavy lifting could be accomplished just by using monolog
Oh, look 'composer'. Last time, that did not go very well. No worries.
There are 4 other vendor libraries using composer, so I didn't think it would be an issue.
./include/vendor/cldr-to-gettext-plural-rules/composer.json
./include/vendor/gettext/composer.json
./include/vendor/phpmailer/composer.json
./include/vendor/flag-icons/composer.json
And monolog is PSR-3 compliant, so it adheres to the PHP logging standards. And the PHP logging standards are RFC5424 compliant, so they adhere to the syslog standard.
Hmm thinking again about Larry's idea of the hostname parameter, i come to a different conclusion ...
'hostname' => either Cacti or the device I'm open,
I believe to match the standard, it should rather be the name of the originating server that generates the log line. This would then actually be the cacti server's hostname (either the main one or one of the remote ones), and the device name and ID should better be reflected in the metadata part only.
It's similar to for example firewall logs, where we have the FW itself writing the log line, which is also reflected as the hostname part in the 5424 syslog. But it might also contain a hostname of a device that connects across it, and that is logged in the (more or less) structured part of the log message.
And this metadata part: 4 => array('name' => 'Poller', 'regex' => '( Poller[)([, \d]+)(])', 'func' => 'text_regex_poller'), would come into play, if the main poller registers something going on with the communication towards a remote poller for example. Or vice versa.
BTW i think we should start with some meta-code, getting all this structured part in order before we fill it with life, providing the actual code that processes everything.
Comments?
Comment on the timestamp: If possible, at least internally we should go for milliseconds, as it is much more precise and therefore easier to reconstruct the sequence, if for example the logs are sent thru a syslog channel across a load balancing device, which causes re-ordering of messages due to differences in internal processing time. I am thinking here about a clustered logging solution like Elastic, which can have a load-balanced Logstash or Beats phalanx in front. All logs arrive in sequence at the LB, but it is not guaranteed that the time a packet needs from the kernel towards the application always has identical delays. Scheduling is on a 1000 ticks/s time scale, so we can have already at least 1ms difference if another task was processed at that time, and that happens constantly.
If we can rely on the precision originating time stamp, where logs are produced in a given logical sequence, this helps in the re-ordering in such cases.
Regarding Graphs and Users:
I think we can handle those always as an array. Either it contains 1 element or it contains more, and it is easier to have it as an array in all cases, that makes the handling easier.
So we can say this:
7 => array('name' => 'Graph', 'regex' => '( Graph\[)([, \d]+)(\])', 'func' => 'text_regex_graphs'),
8 => array('name' => 'Graphs', 'regex' => '( Graphs\[)([, \d]+)(\])', 'func' => 'text_regex_graphs'),
would become this single line:
<n> => array('name' => 'Graphs', 'regex' => '( Graphs?\[)([, \d]+)(\])', 'func' => 'text_regex_graphs'),
Same here:
9 => array('name' => 'User', 'regex' => '( User\[)([, \d]+)(\])', 'func' => 'text_regex_users'),
10 => array('name' => 'Users', 'regex' => '( Users\[)([, \d]+)(\])', 'func' => 'text_regex_users'),
^ your forgot the "s" there, i added it
<n> => array('name' => 'Users', 'regex' => '( Users?\[)([, \d]+)(\])', 'func' => 'text_regex_users'),
(adapt the numbers in front as necessary)
Another comment or question on this. Should we not log Tina table first and then have a proxy forward so as to not impact Cacti if the log receiver is down?
Or better yet just let our sis log D do the forwarding for us. We just have to write to the json format in an alternate location and let it do all work. For those on Windows I guess well I care maybe? I don't know...
Logging into a DB in principle has the same constraint .. if the target DB is down -> fails. But in case of Cacti, which is heavily relying on the DB, this could be a possible idea. And if the logging end point is that critical, it would be load balanced anyway. In case of ELK for example you could configure multiple end points that would be used in a round robin manner or if the end point does not react, we simply switch to the next instance (can be implemented in the logging plug-in).
This brings me to something that we need to consider: If the Cacti instance is in debugging mode, you would flood the logging end-point. Or we simply drop the "debug" level logs by default and let the user decide if they want to include debug logs in the end point. To point out that debugging is active, we could drop a warning once a minute: "Debugging is activated on poller ... for devices=... scripts=..."
And if we let syslog do the stuff, then a) the user is forced to configure the connectivity on TCP basis, otherwise log reception is not guaranteed as UDP is stateless b) we would introduce an intermediate log-format (text-based JSON or key-value format with prepended syslog header) that needs to be parsed again at the actual logging-endpoint
Since caching on failure is generally a very wise idea, I suggest to keep it simple: If we detect that a configured receiver is down, then let's just put the plain JSON objects into a local file. This file can be replayed once the remote server is back again.
We should handle like our syslog integration. Be able to exclude by string match or level.
One thing at a time though. Packaging this week.
Okay, I'm winding down on projects. So, please let me know what you think.
- To keep it simple, create a JSON log path that get's rotated just like the other logs. example: cacti.log.json
- All time in MS format using the RFC layout:
2003-10-11T22:14:15.003Zfor example. - Existing string logging will be converted to json as we discussed.
- cacti_log() will take mixed input either array or string. Array will be converted to JSON in the json logging, and flattened to our current standard of XXX[] in the legacy (visual) logging standard.
- We de-content the forwarding aspects and defer that to say rsyslog which handles this perfectly today. Windows users will have to figure things out I guess.
If we do this, I can have it finished (first pass), in just a few hours.
@bernisys @Linegod @netniV @xmacan @browniebraun @seanmancini, please chime in.
Sounds like a plan
@Linegod , RFC5424 for structured data is not quite JSON. Are we okay just running with JSON though with the zulu timestamp?
@TheWitness: Sounds okay so far, i am just not too comfortable with that part: "cacti_log() will take mixed input either array or string"
I would create a new logging function (func-name TBD) that takes only JSON-like objects (hash-arrays internally), and then point the old func to the new one and in the code replace the logging functions everywhere and review the calls and their contents at the same time.
This offers higher processing speed in the core, because we don't need to distinguish, what kind of object we get, when the func is called.
Regarding the timestamp: That (as well as all further parts of the given data) would be formatted only when the log is actually written to a destination, so internally we can keep a simple number-based time format, maybe epoch + millis or something like that. What ever php offers and what ever is best to process without too much overhead.
And i believe the current cacti.log can (should) be kept, too, as it is more human-readable than JSON. It's just another internal output plug-in, and the formatting of the lines is necessary anyway, when we want to write the lines to syslog, it's a rather similar format. So why not "re-use" that and write the good old cacti.log that the users already know. Avoids some confusion i would say. Though one thing i would change here: all files should be timestamped directly, and not rotated. That imposes basically no overhead when writing the log, we gain simpler internal handling during the rotation time (nothing needed but a new filename with the new date-stamp once the date changes, no extra file-renames), and the internal log-view in the GUI can simply scan all (or the latest 2) timestamped files if needed.
Comments please :) Better talk a bit more first and do it "a bit more right" in the end.