signalfx-agent icon indicating copy to clipboard operation
signalfx-agent copied to clipboard

send events from custom scripts

Open xp-1000 opened this issue 3 years ago • 11 comments

Hello !

There are many ways to run custom scripts from the agent to retrieve metrics which could not be covered by one of the existing monitors:

  • https://docs.signalfx.com/en/latest/integrations/agent/monitors/python-monitor.html
  • https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-python.html
  • https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec
  • https://collectd.org/wiki/index.php/Plugin:Exec

obviously we can also run any script from a simple crontab and send metrics to a supported standard like https://docs.signalfx.com/en/latest/integrations/integrations-reference/integrations.statsd.html

First may be can you advice on the best way to integrate custom script into signalfx agent ? python-monitor is probably the historical recommended approach but given that it will be deprecated, it is probably not the more "future proof" choice.

But my question is about collecting events in addition to metrics. I checked the samples: https://github.com/signalfx/signalfx-agent/tree/master/python/sample and some of the existing collectd based monitors but I did not see any example of event sent from python script.

I am not sure it is even possible except with configuring an org token in the monitor config and use it in the script to request the API directly which is little crappy as you can understand.

Can you provide guidance to send events from custom scripts ? I am open to any language or any monitor/plugin to use.

xp-1000 avatar Mar 12 '21 14:03 xp-1000

browsing the codebase it seems the agent is already able to catch collectd events: https://github.com/signalfx/signalfx-agent/blob/master/pkg/monitors/collectd/collectd.go#L416

I am not sure if we can and how to send custom events from scripts that said.

xp-1000 avatar Mar 12 '21 14:03 xp-1000

sadly it does not seem to provide a builtin function for events: https://github.com/signalfx/signalfx-agent/blob/master/python/sfxmonitor/output.py#L11

xp-1000 avatar Mar 12 '21 14:03 xp-1000

We have been using telegraf exec on our side to do this:

command: 'powershell "C:/ProgramData/SignalFxAgent/checkversion.ps1"'
intervalSeconds: 600
telegrafParser:
  dataFormat: value
  dataType: string

Within the powershell we are referencing the agent token that is set via a config file to determine the token to use for sending, or retrieving an event.

signalFxAccessToken: {"#from": '\ProgramData\SignalFxAgent\config\signalFxAccessToken'}

mcmiv413 avatar Mar 12 '21 14:03 mcmiv413

thanks @mcmiv413 for your prompt answer.

should I understand that any "string" data type from telegraf is handled as events to signalfx ?

xp-1000 avatar Mar 12 '21 14:03 xp-1000

it was a setting I had to enter in because I am technically not sending anything from the telegraf exec "monitor" it just kicks off a script which does all of the work, within the powershell:

-UseBasicParsing was a trick I had to find to make it work

And then of course -proxy if you need a proxy (which I think from a previous issue you do)

 $pubauthheader = @{
        "Content-Type" = 'application/json'
        "X-SF-TOKEN"   = "$defaulttoken"
    }
    $body = @"
[
    {
        `"eventType`": `"someevent.name`",
        `"properties`": {
            `"key`": `"value`",
            `"key2`": `"value2`"
        }
    }
]
"@
Invoke-WebRequest -H $pubauthheader -Method POST "https://ingest.us1.signalfx.com/v2/event" -Body $body -UseBasicParsing

mcmiv413 avatar Mar 12 '21 14:03 mcmiv413

@xp-1000 First question I have is: what kinds of things do you want to send events for? Keep in mind that SIgnalFx events are not logs, they are meant to signal changes in the environment, such as deployments, shutdowns, high-level state changes, etc.

Also the python-monitor monitor is going to be ported to the OTEL Collector for easier transition. So we could extend it to do events as well. It is about the most "future-proof" thing right now, apart from using the API directly from a script, but as you said that is somewhat hacky.

I'm not entirely sure about that telegraf/exec trick, but if it works for you let me know.

keitwb avatar Mar 12 '21 14:03 keitwb

We're actually using that telegraf exec trick to get signalfx to update/upgrade itself, as well as dynamically changing it's own config :-D

mcmiv413 avatar Mar 12 '21 14:03 mcmiv413

@mcmiv413 ok so what I understand is you call the api from your custom script which is the hacky way I mentioned in the issue description. I keep this in last resort. retrieving the token from the config file is a good idea thanks.

@keitwb it is a relevant question and I confess I am not sure what I plan to do is acceptable ^^ What I have in mind is similar to what I did here: https://github.com/signalfx/signalfx-agent/blob/master/pkg/monitors/nagios/nagios.go#L119

in this particular case I want to translate a python datadog agent script which request a rest api, retrieve the json, check multiple fields and their values and report the result as metric. originally this script reports:

  • a gauge for the response time from api : no problem this is 1:1 mapping in signalfx
  • a service check for each "child" from the json to indicate if the service is OK

This service check is difficult to translate to signalfx. I can simply report a gauge with 0/1 value but in case of KO, I will loose its reason (currently reported as string on the service check).

My workaround was to report the fail reason as event in addition to the gauge, from this way we can understand the problem from signalfx webui (as for datadog before) without to have to read the logs of the agent/script and correlate with the timestamp of the alert manually.

In first place I would like to use "property" to store this information but it seems it is not possible to update property from agent monitor (or custom script)

xp-1000 avatar Mar 12 '21 15:03 xp-1000

👀 which python datadog agent script is this, what you propose here sounds very interesting

mcmiv413 avatar Mar 12 '21 15:03 mcmiv413

@mcmiv413 sadly it is a specific script we created for a customer so I cannot share the code here.

in many ways it is similar to the official http_check: https://github.com/DataDog/integrations-core/blob/master/http_check/datadog_checks/http_check/http_check.py it uses requests to fetch the json from an URL, time to calculate the time passed for this requests, the rest is simple json parsing and conditions to change get status for each service (children in the json) and finally sending datadog agent check (OK/CRITICAL) with a context message for each one depending on the json value (OK/KO).

This is a "health" api fully custom to provide the status of all services into an uniq json. In general, I would use http to check an URL or https://collectd.org/wiki/index.php/Plugin:cURL-JSON to retrive values from the json and check them in signalfx but:

  • returned values in json are string (OK, KO), not number so collectd curl json cannot work here
  • it would require to duplicate the request to the API (one for http to get web metrics like status code / time response .. and another one to retrieve json values)
  • we still would have lost the "fail cause" / context of the metric compared to datadog agent check (if KO = 0 so a dectector could raise alert if != 0 but we will not know what is the problem)

Only property or event could allow to bring the context with the metric. In my knowledge, property is not available from the agent (but @keitwb could correct me if I a wrong) so events seems to be the good way. We can send event from monitors but I cannot make a PR on this agent for such a custom "check".

so here I am, I was hoping to find a "native" way to send events from a custom script run by the agent :p

xp-1000 avatar Mar 12 '21 19:03 xp-1000

Yes, properties would be a better way to do it than events but even then it is kind of pushing the boundaries of what they are intended for.

I would recommend using an "enum" type gauge metric where a certain number indicates success (e.g. 1), and other numbers (e.g. negative numbers) indicate various failure modes. Then in your charts/detectors you document the values so it will be obvious what they mean. This assumes of course that you know what the possible failure states are. Even if you don't know all of the failure modes, you have enumerate the ones that you do and then have an "other" value.

keitwb avatar Mar 17 '21 14:03 keitwb

Closing this issue as inactive. Please reopen if this issue is still occurring.

atoulme avatar Sep 22 '22 04:09 atoulme