moosefs
moosefs copied to clipboard
FEATURE REQUEST - Make metrics accessible outside the CGI
Have you read through available documentation and open Github issues?
Yes
Is this a BUG report, FEATURE request, or a QUESTION? Who is the indended audience?
FEATURE request
System information
Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).
3.0.103
Operating system (distribution) and kernel version.
Mix of debian stretch and buster.
I would like to have a way to extract the metrics that are currently displayed in the CGI and write them to statsd
or prometheus
.
That would make it much easier to set up proper alerting rules for various cluster conditions that I can currently only see in the CGI interface.
@xandrus is working on JSON output to mfscli - I hope that it will be enough.
That would be great. If I can run something --json > state.json
, I can parse the state.json and write the metrics I want to statsd
, prometheus
or wherever.
Hi, @acid-maker thanks, that you assigned this task to me. Basically, I have started adding JSON output to mfscgi command a few weeks ago but at the moment I have focused on a different task. Personally I believe that this week I will be able to spend more time on this task.
@unixorn I would like to ask you which mfscli data set would you like to have first in JSON output. I mean
-S data set : defines data set to be displayed
-SIN : show full master info
-SIM : show only masters states
-SLI : show only licence info
-SIG : show only general master (leader) info
-SMU : show only master memory usage
-SIC : show only chunks info (goal/copies matrices)
-SIL : show only loop info (with messages)
-SMF : show only missing chunks/files
-SCS : show connected chunk servers
-SMB : show connected metadata backup servers
-SHD : show hdd data
-SEX : show exports
-SMS : show active mounts
-SRS : show resources (storage classes,open files,acquired locks)
-SSC : show storage classes
-SOF : show only open files
-SAL : show only acquired locks
-SMO : show operation counters
-SQU : show quota info
-SMC : show master charts data
-SCC : show chunkserver charts data
@xandrus , I think I need to come back to our abandoned project we have started last year "after hours" ;) I think we can make decent use of your json output instead of parsing mfscli output we used to do... What do you think? :)
And of course our work was inspired by Ceph Dashboard, there is no doubt about it.
+1
@xandrus, the first things I'd like to see are the connected server counts (both chunk servers and metadata servers), missing chunk data, and hdd data.
@oszafraniec, all the most important information on one page - nice! @unixorn, so I will focus on these resources.
Thanks.
@s5unty I started something similar :smile:
https://gist.github.com/tianon/8db9b120fe83b7ed24c30b7911d739c9
It'd be awesome if we could get something slightly more official (especially standardized metric names). :innocent:
AWK is nice, but google/mtail is more fun. :laughing: WIP
Looking at the mfscli
code, scraping this output can't possibly scale -- there are a lot of instances of if masterconn.version_at_least(x,y,z): ... data.append(...)
(in other words, columns that only even show up in the data at all if the master is over a certain version), so which column corresponds to which data is dependent on the master version number. :weary:
@tianon thank you for the gist! Do you have any grafana dashboard for it?
Unfortunately not -- I played with it for a while, but didn't love that it was scraping the output of a command whose format might change, so I stopped using it. Maybe someday I (or someone else) will port the mfscli
code to a proper exporter... :grimacing: :sweat_smile:
I see we still don't have a proper exporter. That's a shame. The extension of cgiserv should be a piece of cake.
could someone point me - on how to use the awk file (one or the other attached above) to get the data into prometheus?
I am using grafan/influxdb/telegraf. I made a simple script that feeds the database ( not using the last flux 2.x). The commands are very simple, like:
...
mfscli -SIG|awk '/store/{print "moosefs,metric=last_metadata_sync lstupdt=""\""strftime("%d %b %T %Z",$6)"\""}'
mfscli -SIC -p | awk '/stable/ {print "moosefs,metric=chunks_stable stable="$6"i"}'
mfscli -SIC -p | awk '/undergoal:/ {print "moosefs,metric=chunks_under_goal undergoal="$6"i"}'
mfscli -SIC -p | awk '/endangered:/{print "moosefs,metric=chunks_endangered endangered="$6"i"}
...
Then I left ncat listening on an arbitrary port and when connected it launches the script.
Telegraf has the following to connect to ncat:
[[inputs.http]]
urls = [
"http://localhost:9473", "http://localhost:9474"
]
Below a screenshot of grafana dashboard that has the alerts set for the most important events
that's pretty neat I see everybody is using grafana/influxdb/telegraf
I was using Zabbix for years (I mean, I configured it like 5 years ago, and the nit was running), but LonghornFS below my db died on me recently, and I could not figure out how to revive it. So I started playing with grafana and prometheus, as you can imagine I am pretty new to all that
Could you tell me how is grafana/influxdb/telegraf superior to grafana/prometheus?
I don't think that grafana/influxdb/telegraf is superior to grafana/prometheus.
Prometheus is "a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true."
Telegraf is just a collector, it is a single small go binary with a minimum memory footprint, it has just one configuration file and can collect many, many metrics.
It can also collect metrics from the sysstats tools.
I know very little of Prometheus having used Telegraf/Influx/Grafana all the time.
I am also running zabbix in a lxd container and I like the possibility to automatically start a service if it crashes.
Zabbix has also nice infrastructure visualisation.
I am using grafan/influxdb/telegraf. I made a simple script that feeds the database ( not using the last flux 2.x). The commands are very simple, like:
... mfscli -SIG|awk '/store/{print "moosefs,metric=last_metadata_sync lstupdt=""\""strftime("%d %b %T %Z",$6)"\""}' mfscli -SIC -p | awk '/stable/ {print "moosefs,metric=chunks_stable stable="$6"i"}' mfscli -SIC -p | awk '/undergoal:/ {print "moosefs,metric=chunks_under_goal undergoal="$6"i"}' mfscli -SIC -p | awk '/endangered:/{print "moosefs,metric=chunks_endangered endangered="$6"i"} ...
Then I left ncat listening on an arbitrary port and when connected it launches the script.
Telegraf has the following to connect to ncat:
[[inputs.http]] urls = [ "http://localhost:9473", "http://localhost:9474" ]
Below a screenshot of grafana dashboard that has the alerts set for the most important events
Hi, guy. I am using grafan/influxdb/telegraf too. Can you share the commands and the dashboard json?
https://grafana.com/grafana/dashboards/16700-moosefs-overview/ There is a prometheus dashborad, but I do not use Prometheus yet.
Hello eecsea,
no problems on sharing my script and the dashboard but I have switched the DB from influxdb to postgres+timescale so the dashboard is not going to work if you are using influx1/2.
If you are still interested I will upload everything on my site or upload here if it is permitted
Cheers