Problem

I'd ultimately like to expose metrics to prometheus from most of my operators so having a blessed way to do it via the built in kopf aiohttp server would be very nice, rather than rigging something up in each project.

on.probe is close, but i'd like something prometheus can scrape directly.

Proposal

My preferred solution would be to have kopf natively support prometheus via its existing aiohttp server. That means it would set up an aiohttp route that did something ran the prometheus collector and served the results. Ideally would also be a handler so can mix in metrics that have been generated dynamically at probe time.

Failing that a hook to allow me to at least piggy back on the existing aiohttp and reduce some of the duplicate boiler plate code.

I think prometheus is the most common monitoring tool on kubernetes so others may find this useful too.

One pattern with prometheus is to make a custom 'Collector' class that yields MetricFamily objects. We would have a aiohttp view that created such a collector and it would call all the kopf.on.prometheus handlers and allow them to yield metric families:

from prometheus_client.core import GaugeMetricFamily

@kopf.on.prometheus
def collect_prom_metrics(self):
    my_measurement = GaugeMetricFamily('my_measurement', 'some description', labels=["label"])
    my_measurement.add_metric(["mylabel"], 52)
    yield my_measurement

This technique is used if you are calculating the metric or sampling it from an external source.

The other option is used when you are instrumenting the operator itself:

from prometheus_client import Counter
MY_COUNTER = Counter(
    "resource_created", "Resource created by operator", ["labelkey1", "labelkey2"]
)

@kopf.on.create(...snip...)
def myhandler(**kwargs):
    MY_COUNTER.labels("labelvalue1", "labelvalue2").inc()

I'm thinking of working on this is people think it would be good to have. I hope to implement it in a way that keeps the prometheus dependency optional.

Although it would be good to instrument kopf itself at some point I think that is a seperate feature/request.

Checklist

[x] Many users can benefit from this feature, it is not a one-time case
[x] The proposal is related to the K8s operator framework, not to the K8s client libraries

Jan 06 '20 14:01 Jc2k

My approach to this was going to be to add a route to the existing aiohttp.web.Application(), the question is would it be ok for the functionality to live in probing.py or does it need breaking out somehow?

Jan 07 '20 11:01 Jc2k

Hm. First of all, I am not familiar with Prometheus, so have to read & learn the concepts first, before making any judgements.

Nevertheless, despite I understand that Prometheus is a popular (or the most popular) monitoring tool now, it all can change. Beside, there can be other opinions, and I do not like pushing my or someone's personal preferences as unavoidable defaults for everyone (see a counter-example: Black).

Before we go into deep integration with Prometheus with same-named decorators in the public interface, I would like to clarify few things:

Can the Prometheus metrics be served in a separate thread on a separate port as in their README? What are the benefits of tight integration? is it the usage of the probing/metrics decorators?

Can this Kopf's feature be generalised for any monitoring system? Perhaps, different endpoints with different output formatters?

Can Kopf be extended or redesigned somehow, so that becomes extensible for these metrics and endpoints, but does not contain them directly? Specifically, can Kopf be made pluggable, and Prometheus-specific tooling implemented as another Python library with some hooks/connections points to Kopf (see pytest plugins as an example)? The authentication subsystem can also benefit from this pluggability.

Also, can you please clarify why kopf.on.probe is not sufficient? Is it a different output format only? Or are there other challenges of using the probe-handlers?

Also, @kopf.on.prometheus sounds unusual. The original intention was to use verbs there (on.create), with growing urge to replace them with proper English nouns representing events or actions (on.creation). In a work-in-progress branch I already have a case that does not fit the convention, and use a separate decorator naming convention: @kopf.daemon (no .on.). We can think of @kopf.metric — but the question is: do we need it at all (see above)?

Jan 07 '20 11:01 nolar

All good points!

Can the Prometheus metrics be served in a separate thread?

In general I don't like using their threaded web server in a thread when aiohttp is available. This is probably a personal preference thing, but at a technical level there are 2 main issues:

If you want to run a collector that uses async you can't use their server.
You have done some good work to integrate the aiohttp server into the kopf lifecycle. I want to leverage that. Theirs (i assume) uses daemon threads so the OS will just pull the rug from under the thread when kopf exits.

Can this Kopf's feature be generalised for any monitoring system?

The problem with a generic monitoring integration is that they all provide helpers that you'd have to implement to be as nice to use. For example:

REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')

# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
    """A dummy function that takes some time."""
    pass

This will generate 2 metrics: request_processing_seconds_count and request_processing_seconds_sum.

Building an abstract version of this is probably not something that belongs in kopf, even though monitoring is very important for most operators.

Can Kopf be extended or redesigned somehow, so that becomes extensible for these metrics and endpoints, but does not contain them directly?

There are maybe 3 changes I would like if there wasn't native prometheus integration:

Factor the aiohttp code out of probing.py so it can be reused by the operator for its own services. This is mostly about the little details like not handling signals and having it automatically shutdown when kopf goes away.
I would quite like to be able to control the routes (at least at my own) to a single aiohttp. It's arguably a taste thing but I don't want to listen on multiple ports if i don't need to.
I'm glad for kopf.on.startup() and @kopf.on.cleanup(), but one pattern that might be useful is kopf.background_task - something that is started on startup and it gets .cancel() and awaited on cleanup automatically. I take this a bit further in some of my code and restart background tasks if they hit an unhandled exception.

import asyncio

import kopf
import kopf_prometheus
from prometheus_client import Counter

MY_COUNTER = Counter('my_requests_total', 'HTTP Failures', ['method', 'endpoint'])

async def my_collector():
    yield SomePrometheusMetric()
    yield SomeOtherMetric

async def my_looping_function():
    while True:
        c.labels('get', '/').inc()
        asyncio.sleep(1)

kopf.background_task(my_looping_function)
kopf.background_task(kopf_prometheus.listen, port=8080, collectors=[my_collector])

Are there other challenges of using the probe-handlers?

This is mostly just down to the output format plus the richness of the prometheus_client API. You could probably serialize the prometheus data into probe format, but then you'd need get it back into prometheus format. A prometheus metric has a timestamp and multiple labels associated with it (and that means a single metric maybe listed multiple times) so there are some challenges there too, possibly.

Jan 07 '20 12:01 Jc2k

… integrate the aiohttp server into the kopf lifecycle. I want to leverage that.

That is exactly the point I see as problematic here:

aiohttp is an internal implementation detail, the current way to expose the liveness via HTTP. The public interface only has the concept of liveness probes and some rudimentary metrics (domain-level concepts), not the way how they are exposed to K8s (implementation details).

Injecting endpoints assumes either that the internal implementation detail (aiohttp server) is exposed and made part of the Kopf's public interface (see leaky abstractions), or that Kopf is extended with a lot of features which are more suitable for a web server (e.g. routes and endpoints).

Neither of this looks good. I prefer that the framework is not pulled back by the presence of aiohttp specifically when it comes to changing the way the HTTP(S) probes are served — as long as other features on their abstraction levels (e.g. liveness probes) could be served.

One way how this could be generalised is, indeed, a daemon-handler for the whole operator. I currently have them implemented per-object in my own branch (and struggle only with some dirty tricks to be resolved to a clean code).

But it can be easily extended to have per-operator daemons, and those daemons can be async def functions that run their own web servers (maybe even aiohttp in the same event loop), and cancelled as all other async coroutines with the operator shutdown.

Reproducing your example:

import kopf
import kopf_prometheus  # somehow, by someone

@kopf.daemon(errors=kopf.ErrorsMode.TEMPORARY, backoff=10)
async def prometheus_server(**_):
    server = kopf_prometheus.create_async(port=8080, collectors=[my_collector])
    await server.run()  # the same asyncio event loop as the operator

@kopf.daemon()
def my_looping_function(stopped, **_):
    while not stopped:
        c.labels('get', '/').inc()
        time.sleep(1)  # asyncio-safe: runs in a thread

The pre-operator daemons are not in that branch yet, but once per-resource daemons are there, it is easy to implement.

@Jc2k What do you think on this? Would this feature solve the problem (assuming that someone somehow will make a Kopf<->Prometheus connector library).

PS: Maybe even a global timer (but produces a lot of logs):

@kopf.on.timer(interval=1)
def my_timing_function(**_):
    c.labels('get', '/').inc()

Jan 07 '20 15:01 nolar

Yes, this sounds great! I would probably make such a prometheus connector if kopf.daemon was available.

Seperately I do have use for a timer as well, but not for the interval=1 case thankfully :D

Jan 07 '20 16:01 Jc2k

kopf
kopf copied to clipboard

Export prometheus metrics at `/metrics`

Problem

Proposal

Checklist

Can the Prometheus metrics be served in a separate thread?

Can this Kopf's feature be generalised for any monitoring system?

Can Kopf be extended or redesigned somehow, so that becomes extensible for these metrics and endpoints, but does not contain them directly?

Are there other challenges of using the probe-handlers?

kopf kopf copied to clipboard

Export prometheus metrics at `/metrics`

Problem

Proposal

Checklist

Can the Prometheus metrics be served in a separate thread?

Can this Kopf's feature be generalised for any monitoring system?

Can Kopf be extended or redesigned somehow, so that becomes extensible for these metrics and endpoints, but does not contain them directly?

Are there other challenges of using the probe-handlers?

kopf
kopf copied to clipboard