signalfx-agent
signalfx-agent copied to clipboard
send events from custom scripts
Hello !
There are many ways to run custom scripts from the agent to retrieve metrics which could not be covered by one of the existing monitors:
- https://docs.signalfx.com/en/latest/integrations/agent/monitors/python-monitor.html
- https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-python.html
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec
- https://collectd.org/wiki/index.php/Plugin:Exec
obviously we can also run any script from a simple crontab and send metrics to a supported standard like https://docs.signalfx.com/en/latest/integrations/integrations-reference/integrations.statsd.html
First may be can you advice on the best way to integrate custom script into signalfx agent ? python-monitor
is probably the historical recommended approach but given that it will be deprecated, it is probably not the more "future proof" choice.
But my question is about collecting events in addition to metrics. I checked the samples: https://github.com/signalfx/signalfx-agent/tree/master/python/sample and some of the existing collectd based monitors but I did not see any example of event sent from python script.
I am not sure it is even possible except with configuring an org token in the monitor config and use it in the script to request the API directly which is little crappy as you can understand.
Can you provide guidance to send events from custom scripts ? I am open to any language or any monitor/plugin to use.
browsing the codebase it seems the agent is already able to catch collectd events: https://github.com/signalfx/signalfx-agent/blob/master/pkg/monitors/collectd/collectd.go#L416
I am not sure if we can and how to send custom events from scripts that said.
sadly it does not seem to provide a builtin function for events: https://github.com/signalfx/signalfx-agent/blob/master/python/sfxmonitor/output.py#L11
We have been using telegraf exec on our side to do this:
command: 'powershell "C:/ProgramData/SignalFxAgent/checkversion.ps1"'
intervalSeconds: 600
telegrafParser:
dataFormat: value
dataType: string
Within the powershell we are referencing the agent token that is set via a config file to determine the token to use for sending, or retrieving an event.
signalFxAccessToken: {"#from": '\ProgramData\SignalFxAgent\config\signalFxAccessToken'}
thanks @mcmiv413 for your prompt answer.
should I understand that any "string" data type from telegraf is handled as events to signalfx ?
it was a setting I had to enter in because I am technically not sending anything from the telegraf exec "monitor" it just kicks off a script which does all of the work, within the powershell:
-UseBasicParsing was a trick I had to find to make it work
And then of course -proxy if you need a proxy (which I think from a previous issue you do)
$pubauthheader = @{
"Content-Type" = 'application/json'
"X-SF-TOKEN" = "$defaulttoken"
}
$body = @"
[
{
`"eventType`": `"someevent.name`",
`"properties`": {
`"key`": `"value`",
`"key2`": `"value2`"
}
}
]
"@
Invoke-WebRequest -H $pubauthheader -Method POST "https://ingest.us1.signalfx.com/v2/event" -Body $body -UseBasicParsing
@xp-1000 First question I have is: what kinds of things do you want to send events for? Keep in mind that SIgnalFx events are not logs, they are meant to signal changes in the environment, such as deployments, shutdowns, high-level state changes, etc.
Also the python-monitor
monitor is going to be ported to the OTEL Collector for easier transition. So we could extend it to do events as well. It is about the most "future-proof" thing right now, apart from using the API directly from a script, but as you said that is somewhat hacky.
I'm not entirely sure about that telegraf/exec trick, but if it works for you let me know.
We're actually using that telegraf exec trick to get signalfx to update/upgrade itself, as well as dynamically changing it's own config :-D
@mcmiv413 ok so what I understand is you call the api from your custom script which is the hacky way I mentioned in the issue description. I keep this in last resort. retrieving the token from the config file is a good idea thanks.
@keitwb it is a relevant question and I confess I am not sure what I plan to do is acceptable ^^ What I have in mind is similar to what I did here: https://github.com/signalfx/signalfx-agent/blob/master/pkg/monitors/nagios/nagios.go#L119
in this particular case I want to translate a python datadog agent script which request a rest api, retrieve the json, check multiple fields and their values and report the result as metric. originally this script reports:
- a gauge for the response time from api : no problem this is 1:1 mapping in signalfx
- a service check for each "child" from the json to indicate if the service is OK
This service check is difficult to translate to signalfx. I can simply report a gauge with 0/1 value but in case of KO, I will loose its reason (currently reported as string on the service check).
My workaround was to report the fail reason as event in addition to the gauge, from this way we can understand the problem from signalfx webui (as for datadog before) without to have to read the logs of the agent/script and correlate with the timestamp of the alert manually.
In first place I would like to use "property" to store this information but it seems it is not possible to update property from agent monitor (or custom script)
👀 which python datadog agent script is this, what you propose here sounds very interesting
@mcmiv413 sadly it is a specific script we created for a customer so I cannot share the code here.
in many ways it is similar to the official http_check: https://github.com/DataDog/integrations-core/blob/master/http_check/datadog_checks/http_check/http_check.py
it uses requests
to fetch the json from an URL, time
to calculate the time passed for this requests, the rest is simple json parsing and conditions to change get status for each service (children in the json) and finally sending datadog agent check (OK/CRITICAL) with a context message for each one depending on the json value (OK/KO).
This is a "health" api fully custom to provide the status of all services into an uniq json. In general, I would use http
to check an URL or https://collectd.org/wiki/index.php/Plugin:cURL-JSON to retrive values from the json and check them in signalfx but:
- returned values in json are string (OK, KO), not number so collectd curl json cannot work here
- it would require to duplicate the request to the API (one for
http
to get web metrics like status code / time response .. and another one to retrieve json values) - we still would have lost the "fail cause" / context of the metric compared to datadog agent check (if KO = 0 so a dectector could raise alert if != 0 but we will not know what is the problem)
Only property or event could allow to bring the context with the metric. In my knowledge, property is not available from the agent (but @keitwb could correct me if I a wrong) so events seems to be the good way. We can send event from monitors but I cannot make a PR on this agent for such a custom "check".
so here I am, I was hoping to find a "native" way to send events from a custom script run by the agent :p
Yes, properties would be a better way to do it than events but even then it is kind of pushing the boundaries of what they are intended for.
I would recommend using an "enum" type gauge metric where a certain number indicates success (e.g. 1), and other numbers (e.g. negative numbers) indicate various failure modes. Then in your charts/detectors you document the values so it will be obvious what they mean. This assumes of course that you know what the possible failure states are. Even if you don't know all of the failure modes, you have enumerate the ones that you do and then have an "other" value.
Closing this issue as inactive. Please reopen if this issue is still occurring.