moosefs icon indicating copy to clipboard operation
moosefs copied to clipboard

FEATURE REQUEST - Make metrics accessible outside the CGI

Open unixorn opened this issue 5 years ago • 20 comments

Have you read through available documentation and open Github issues?

Yes

Is this a BUG report, FEATURE request, or a QUESTION? Who is the indended audience?

FEATURE request

System information

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

3.0.103

Operating system (distribution) and kernel version.

Mix of debian stretch and buster.

I would like to have a way to extract the metrics that are currently displayed in the CGI and write them to statsd or prometheus.

That would make it much easier to set up proper alerting rules for various cluster conditions that I can currently only see in the CGI interface.

unixorn avatar Aug 28 '19 19:08 unixorn

@xandrus is working on JSON output to mfscli - I hope that it will be enough.

acid-maker avatar Nov 07 '19 06:11 acid-maker

That would be great. If I can run something --json > state.json, I can parse the state.json and write the metrics I want to statsd, prometheus or wherever.

unixorn avatar Nov 07 '19 15:11 unixorn

Hi, @acid-maker thanks, that you assigned this task to me. Basically, I have started adding JSON output to mfscgi command a few weeks ago but at the moment I have focused on a different task. Personally I believe that this week I will be able to spend more time on this task.

@unixorn I would like to ask you which mfscli data set would you like to have first in JSON output. I mean

	-S data set : defines data set to be displayed
		-SIN : show full master info
		-SIM : show only masters states
		-SLI : show only licence info
		-SIG : show only general master (leader) info
		-SMU : show only master memory usage
		-SIC : show only chunks info (goal/copies matrices)
		-SIL : show only loop info (with messages)
		-SMF : show only missing chunks/files
		-SCS : show connected chunk servers
		-SMB : show connected metadata backup servers
		-SHD : show hdd data
		-SEX : show exports
		-SMS : show active mounts
		-SRS : show resources (storage classes,open files,acquired locks)
		-SSC : show storage classes
		-SOF : show only open files
		-SAL : show only acquired locks
		-SMO : show operation counters
		-SQU : show quota info
		-SMC : show master charts data
		-SCC : show chunkserver charts data

xandrus avatar Nov 13 '19 19:11 xandrus

@xandrus , I think I need to come back to our abandoned project we have started last year "after hours" ;) I think we can make decent use of your json output instead of parsing mfscli output we used to do... What do you think? :)

And of course our work was inspired by Ceph Dashboard, there is no doubt about it.

IMG_9642

Zrzut ekranu 2018-07-25 o 12 58 12

oszafraniec avatar Nov 13 '19 20:11 oszafraniec

+1

jkiebzak avatar Nov 13 '19 20:11 jkiebzak

@xandrus, the first things I'd like to see are the connected server counts (both chunk servers and metadata servers), missing chunk data, and hdd data.

unixorn avatar Nov 13 '19 20:11 unixorn

@oszafraniec, all the most important information on one page - nice! @unixorn, so I will focus on these resources.

Thanks.

xandrus avatar Nov 13 '19 21:11 xandrus

@s5unty I started something similar :smile:

https://gist.github.com/tianon/8db9b120fe83b7ed24c30b7911d739c9

It'd be awesome if we could get something slightly more official (especially standardized metric names). :innocent:

tianon avatar Apr 30 '20 17:04 tianon

AWK is nice, but google/mtail is more fun. :laughing: WIP

s5unty avatar May 05 '20 01:05 s5unty

Looking at the mfscli code, scraping this output can't possibly scale -- there are a lot of instances of if masterconn.version_at_least(x,y,z): ... data.append(...) (in other words, columns that only even show up in the data at all if the master is over a certain version), so which column corresponds to which data is dependent on the master version number. :weary:

tianon avatar May 07 '20 02:05 tianon

@tianon thank you for the gist! Do you have any grafana dashboard for it?

uu avatar Oct 11 '21 08:10 uu

Unfortunately not -- I played with it for a while, but didn't love that it was scraping the output of a command whose format might change, so I stopped using it. Maybe someday I (or someone else) will port the mfscli code to a proper exporter... :grimacing: :sweat_smile:

tianon avatar Oct 11 '21 17:10 tianon

I see we still don't have a proper exporter. That's a shame. The extension of cgiserv should be a piece of cake.

could someone point me - on how to use the awk file (one or the other attached above) to get the data into prometheus?

eleaner avatar Aug 25 '22 10:08 eleaner

I am using grafan/influxdb/telegraf. I made a simple script that feeds the database ( not using the last flux 2.x). The commands are very simple, like:

...
mfscli -SIG|awk '/store/{print "moosefs,metric=last_metadata_sync lstupdt=""\""strftime("%d %b %T %Z",$6)"\""}'
mfscli -SIC -p | awk '/stable/ {print "moosefs,metric=chunks_stable stable="$6"i"}'
mfscli -SIC -p | awk '/undergoal:/ {print "moosefs,metric=chunks_under_goal undergoal="$6"i"}'
mfscli -SIC -p | awk '/endangered:/{print "moosefs,metric=chunks_endangered endangered="$6"i"}
...

Then I left ncat listening on an arbitrary port and when connected it launches the script.

Telegraf has the following to connect to ncat:

 [[inputs.http]]
   urls = [
	"http://localhost:9473", "http://localhost:9474"
   ]

Below a screenshot of grafana dashboard that has the alerts set for the most important events

2022-08-25-114330_1285x912_scrot

maxlivi avatar Aug 25 '22 10:08 maxlivi

that's pretty neat I see everybody is using grafana/influxdb/telegraf

I was using Zabbix for years (I mean, I configured it like 5 years ago, and the nit was running), but LonghornFS below my db died on me recently, and I could not figure out how to revive it. So I started playing with grafana and prometheus, as you can imagine I am pretty new to all that

Could you tell me how is grafana/influxdb/telegraf superior to grafana/prometheus?

eleaner avatar Aug 26 '22 22:08 eleaner

I don't think that grafana/influxdb/telegraf is superior to grafana/prometheus.

Prometheus is "a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true."

Telegraf is just a collector, it is a single small go binary with a minimum memory footprint, it has just one configuration file and can collect many, many metrics.

It can also collect metrics from the sysstats tools.

I know very little of Prometheus having used Telegraf/Influx/Grafana all the time.

I am also running zabbix in a lxd container and I like the possibility to automatically start a service if it crashes.

Zabbix has also nice infrastructure visualisation.

maxlivi avatar Aug 27 '22 13:08 maxlivi

I am using grafan/influxdb/telegraf. I made a simple script that feeds the database ( not using the last flux 2.x). The commands are very simple, like:

...
mfscli -SIG|awk '/store/{print "moosefs,metric=last_metadata_sync lstupdt=""\""strftime("%d %b %T %Z",$6)"\""}'
mfscli -SIC -p | awk '/stable/ {print "moosefs,metric=chunks_stable stable="$6"i"}'
mfscli -SIC -p | awk '/undergoal:/ {print "moosefs,metric=chunks_under_goal undergoal="$6"i"}'
mfscli -SIC -p | awk '/endangered:/{print "moosefs,metric=chunks_endangered endangered="$6"i"}
...

Then I left ncat listening on an arbitrary port and when connected it launches the script.

Telegraf has the following to connect to ncat:

 [[inputs.http]]
   urls = [
	"http://localhost:9473", "http://localhost:9474"
   ]

Below a screenshot of grafana dashboard that has the alerts set for the most important events

2022-08-25-114330_1285x912_scrot

Hi, guy. I am using grafan/influxdb/telegraf too. Can you share the commands and the dashboard json?

https://grafana.com/grafana/dashboards/16700-moosefs-overview/ There is a prometheus dashborad, but I do not use Prometheus yet.

seecsea avatar Dec 13 '23 02:12 seecsea

Hello eecsea,

no problems on sharing my script and the dashboard but I have switched the DB from influxdb to postgres+timescale so the dashboard is not going to work if you are using influx1/2.

If you are still interested I will upload everything on my site or upload here if it is permitted

Cheers

maxlivi avatar Dec 13 '23 11:12 maxlivi