ceph-nagios-plugins icon indicating copy to clipboard operation
ceph-nagios-plugins copied to clipboard

Whats about a new Github Release?

Open j-licht opened this issue 5 years ago • 0 comments

As I see you build Debain Packages with version 1.5.5, but there as only a github release of version 1.5.0.

j-licht avatar Feb 13 '20 16:02 j-licht

Yes, Even if we don't make it an official feature, a Prometheus exporter is a valuable tool for benchmarking and debugging.

I've done this before on small scales with a shell script that issued JSON RPC commands to the SPDK target app, and transformed the output of various SPDK stats commands into a Prometheus text file with the jq utility. That kludge isn't good enough here, but it worked well enough to monitor ops and bytes/second for one or two bdevs for some experiments I did. When there are hundreds of bdevs we might find the overhead of frequent JSON RPC stats queries to be awkward. It's probably reasonable to start with something that gets the metrics with JSON RPC, and keep an eye on how that scales.

Besides IO stats per bdev, metrics for connected hosts and IOs per second by host would be helpful. IDK if there's a JSON RPC command to reveal connected hosts to the NVMF target.

When we get a discovery service, we'd like metrics from it as well.

It probably makes sense to define a Grafana dashboard for the gateway that aggregates the metrics from all gateways and the DS. This could help users confirm their fleet of hosts are connecting to the gateways they expect, with the transports they expect them to use.

sdpeters avatar Sep 15 '22 19:09 sdpeters

Hi @sdpeters , thanks for your answer and explanations Great to know that it makes sense to have this monitoring feature.

I will be more than happy to help with this, but it would be nice first to discuss a little bit about the design and overall, and the most important part, about the set of metrics that have sense.

Correct me if i am wrong, but i think that the nvmeof-gateway server really acts as a "client" for Ceph RBD images. What i see in the code, is that the nvmeof-gateway is just used to prepare SPDK targets pointing to RBD images.

From Ceph we have the right metrics to know the performance and usage of any pool, but not for specific RBD images. As each "bdev" is defined using a pool and a RBD image, I think here comes the part where the nvmeof-gateway server can help.

What if we add directly a metrics endpoint to the gRPC server instead of using a Prometheus exporter or any other tool?. I do not know well the SPDK env, but i think that it will be possible for the nvmeof-gateway server to collect the data coming from the different "bdev's" and expose it in the metrics endpoint. (a set of usage/preformance metrics for each bdev)

If this is a valid approach, I can help implementing the prometheus endpoint, but i will need your help in order to collect the right metrics from each bdev and other stuff related with SPDK.

jmolmo avatar Sep 20 '22 14:09 jmolmo

The metrics provided by this are at least part of what's needed to meet the requirement in #116. #116 probably requires some metrics from the discovery services as well, to get a complete picture of the hosts connected to (or attempting to connect to) the gateway.

sdpeters avatar Apr 28 '23 21:04 sdpeters

@sdpeters I wrote a quick exporter while doing some performance testing - https://hub.docker.com/r/pcuzner/ceph-spdk-exporter. The code just uses spdk rpc calls - but perhaps that would help as a start point. I also have a grafana dashboard if that's of interest. image

pcuzner avatar Sep 26 '23 00:09 pcuzner