agent icon indicating copy to clipboard operation
agent copied to clipboard

Standardized protocol

Open foxycode opened this issue 4 years ago • 18 comments

  • It would be nice to have standardized and mainly documented protocol.
  • Sending JSON instead of raw data won't hurt too. If New Relic can do it, why not you?
  • Things like disk health should be evaluated in the script, not on your backend, so anyone can do own implementation.

foxycode avatar Mar 07 '20 17:03 foxycode

I made a mistake, I was missing base64 after system update and that was why script wasn't working, sorry. Still, it would be nice to have standardized protocol.

foxycode avatar Mar 07 '20 17:03 foxycode

Hello,

I can confirm that the back-end code for each version does not change when newer agent versions are released, meaning that all of our older agents are still fully compatible and working.

We'll work on a standardized protocol for future agent versions.

Thanks for the feedback.

hetrixtools avatar Mar 09 '20 09:03 hetrixtools

Can we get a documented API? I'd like to extend this outside of Linux as well. It was trivial to make work with Alpine, but some of this is kind of Linuxish to support BSDs. I'd prefer to wait for something documented before writing a compatible posting tool.

Thanks!

sholwe avatar Aug 09 '20 19:08 sholwe

I'd like documented API too. Right now, some features like RAID health are parsed and processed on backend, which isn't ideal state.

foxycode avatar Aug 09 '20 21:08 foxycode

Hello,

We'll be working on a standardized protocol for our agent in a future release, along with documentation info regarding this.

Thank you for the feedback.

hetrixtools avatar Aug 10 '20 08:08 hetrixtools

Do you have any release date?

foxycode avatar Aug 10 '20 09:08 foxycode

Do you have any release date?

Unfortunately not at this time.

hetrixtools avatar Aug 10 '20 09:08 hetrixtools

Most of this was done by hand, since I can't always use Linuxisms. This works for version 1.59; I've implemented (most of) it for OpenBSD.

POSTDATA="v=$VERSION&s=$SID&d=$OS|$Uptime|$CPUModel|$CPUSpeed|$CPUCores|$CPU|$IOW|$RAMSize|$RAM|$SwapSize|$Swap|$DISKs|$NICS|$ServiceStatusString|$RAID|$DH|$RPS1|$RPS2|$IOPS|$CONN|$DISKi"

v= current version string - 1.59 (may be decimal 2 precision)
s= Local system string hash (Site ID)
d= String [see below, all terminated with pipes]
OS (b)= String - Shortname or $(uname -s)$(uname -r)"|"$(uname -r)"|"RequiresReboot INT (1 true or 0)
Uptime = seconds since boot
CPUModel (b) = string
CPUSpeed (b) = speed of CPU (int)
CPUCores= int number of cores
CPU = Average of CPUSpeed for post period
IOW = IOWait decimal 2 precision
RAMSize = Complete RAM size (MB)
RAM = used RAM (MB) in percentage
SwapSize = Total (MB)
Swap = Used (MB) in percentage 
NICS (gb) = (array) "|"interface";"inbytes";"outbytes";""|"interface";"inbytes";"outbytes";'... 
DISKs (gb) =  (array) mount point, totalsize (bytes), available(bytes)
RAID (gb) = {{have no implemented}}
DH (gb) = {{have not implemented}} (array) {lsblk name"|{smartctl -H}|"...}
RPS1 = unimplemented
RPS2 = unimplemented
IOPS (gb) =  {{have not implemented}}
CONN (b) = (array) "PortNumber"|"NumberOfConnectionsToPort";"
DISKi (gb) = (array) mountpoint, total inodes, used inodes, available inodes";"

(g) noted is encoded to post with: gzip -cf (b) noted as base64 encoded with base64prep() (in script)

Yes, this is really brief, and an enormous mess. The biggest issue I ran into is with their "base64prep" function which is nonstandard as well - it just changes things to post without bring escaped by the webservice. "+" is converted to "%2B" and "\" is rewritten to "%2F" - kind of a mini htmlspecialchars().

The way the script gets average network data is one of the most bizarre things I've ever seen to date. It makes an array and loops several times to increment over the period of time that it expects to run (roughly a minute). Since I can rely on getting pretty normalized data over a period of time, I take a snapshot when it first runs, then count the bytes sent/received before I have the script echo roughly 52 seconds later. Still a cheat, but accurate enough for a 0.01 release.

sholwe avatar Aug 17 '20 01:08 sholwe

@sholwe thank you for putting in the time to write all of this down.

We know that the agent data aggregation is quite messy at this time, the person who coded it did not do it justice; however, the collected stats are on par with many other tested tools.

We'll work on a standardized protocol, along with more code cleanup/optimization, in the next major agent release version.

Thanks again for your time and effort.

hetrixtools avatar Aug 17 '20 09:08 hetrixtools

Hi @hetrixtools -

As @foxycode has stated, your service seems to take much of this raw data and decide what to do with it when it's parsed on your end. That means we'll need to adapt any specific information for the RAID, etc, and hope that it's handled correctly. Can we get a basic post system for you to store and aggregate without whatever logic is being used there?

Thanks - when I clean it up, I'll submit my OBSD code to you; I haven't got a FreeBSD box at the moment, but since it's primarily sysctl/netstat based, shouldn't take much effort.

sholwe avatar Aug 17 '20 14:08 sholwe

Since it's relevant, I'll add link to my SmartOS/Solaris fork: https://github.com/sunfoxcz/hetrixtools-agent-smartos/tree/smartos

foxycode avatar Aug 17 '20 15:08 foxycode

Yaay! I miss Solaris. 2.6 5/98 will forever be in my heartworms. Here's an OpenBSD "functional" version.

https://github.com/sholwe/hetrixtools-agent-openbsd

sholwe avatar Aug 17 '20 23:08 sholwe

@hetrixtools Maybe add forks links to repository README would be nice?

foxycode avatar Aug 17 '20 23:08 foxycode

@foxycode added.

Thank you everyone for your contributions.

hetrixtools avatar Aug 18 '20 09:08 hetrixtools

@hetrixtools Any progress with standardized protocol? My agent implementation won't show SMART status after upgrading to last SmartOS version and I once again don't have idea why and can't debug thing.

foxycode avatar Feb 17 '22 01:02 foxycode

@foxycode It's going to be here-

if [ "$CheckDriveHealth" -gt 0 ] then if [ -x "$(command -v smartctl)" ] #Using S.M.A.R.T. (for regular HDD/SSD) then for i in $(diskinfo -cH | grep -v "??R" | awk '{ print $2 }') do DHealth=$(smartctl -A /dev/rdsk/$i) if grep -q 'Attribute' <<< $DHealth then DHealth=$(smartctl -H /dev/rdsk/$i)"\n$DHealth" DH="$DH|1\n$i\n$DHealth\n" fi done fi if [ -x "$(command -v nvme)" ] #Using nvme-cli (for NVMe) then for i in $(lsblk -l | grep 'disk' | awk '{ print $1 }') do DHealth=$(nvme smart-log /dev/$i) if grep -q 'NVME' <<< $DHealth then if [ -x "$(command -v smartctl)" ] then DHealth=$(smartctl -H /dev/${i%??})"\n$DHealth" fi DH="$DH|2\n$i\n$DHealth\n" fi done fi fi

I'm afraid I haven't touched SmartOS in ages. Check to see if smartctl has been deprecated or the format has changed for the output. You can still use my above reverse engineered POST data to roll your own.

sholwe avatar Feb 20 '22 20:02 sholwe

@sholwe I already fixed it, but problem is, that smartctl output is analyzed on hextrixtools side, which is bad concept. Noone can implement it's own disk check. If you don't have proper smartctl on you machine, you have bad luck.

foxycode avatar Feb 20 '22 20:02 foxycode

Yikes. I noticed they were doing this with other data back for the 1.59 release. I saw you were based on 1.58, but wasn't sure what might have been changed.

sholwe avatar Feb 20 '22 20:02 sholwe