Influx-Capacitor service self monitoring
In my monitoring system, normally I check the state of a service to ensure that is running. When monitor influx-capacitor service, even its running we have no guarantees that it is collecting data. For example, if connection to database is unavailable:
Log Name: Application Source: Tharga.Toolkit.Console Date: 26/02/2016 10:06:30 Event ID: 0 Task Category: None Level: Error Keywords: Classic User: N/A Computer: ComputerName Description: Could not establish a connection to the database.
We could have an option to configure an URL where the service will send current state at specific times, and possible error messages like database unavailable. It would work as a heartbeat monitoring.
Something like this:
<Influx-Capacitor>
<HealthCheck Type="Nagios|OpsMgr|Custom" Enabled="true" SecondsInterval="60" SendErrorMessages="true">
<MachineName>MyComputer</MachineName>
<Url>http://mymonitor.com/AgentStatus</Url>
</HealthCheck>
<Influx-Capacitor>
You mean like a heartbeat?
I have also started to add log4net so that it would be possible to debug issues.
Yes, like a heartbeat. Log4net is useful to local debug, but how to identify that service is having error to send data to database? If service is up, but cannot access the database, we have no option to identify this error. Nowadays, we must access the machine and search on event viewer.
What do you think about, when the service cannot access the database, instead of we have monitoring, the service goes down? It will be easier to monitor because we just need to check service state. My concern is about machine cannot send data and how to identify that problem (we have a lot of machines sending data).
I think a heartbeat with information about the latest issues is a good idea. That would make it possible to monitor several machines in one place.
This would tie in nicely with #29. The heartbeat could be as simple as a timestamp with status, sent to a central database.
Some scenarios:
- If the database is down - don't worry about the agents, just fix the database ;)
- If the agent can't reach the database: Since the central database would have the configuration for that agent as well, it would be easy to calculate when the next heartbeat should be sent. If the agent misses a heartbeat (+ some leeway), then you have an incident.
- If some data is sent, but not all: Again, since the database contains the configuration, it should be able to see what is supposed to be sent. If everything look OK, but some data is sent, then this could still be identified.