tetragon e2e-framework: implement metricschecker

We already have an rpcchecker package to verify Tetragon events from our gRPC API. The next item on my wishlist is a metricschecker we could use to verify specific prometheus metrics during end-to-end tests. For example, a test could assert that we have a specific event count for a given pod or that we have no occurrences of a specific error.

To do this, we would need to add a new metricschecker package to tests/e2e and write some logic to parse and compare prometheus metrics to expected values. Then we just expose this as a features.Func just like we do for the rpcchecker.

Aug 04 '22 20:08 willfindlay

Hello @willfindlay !

I am new to writing e2e tests in Go, I would like to work on this issue on a "learn on the Go" basis. So, it would be great if you could point me some directions so that I can get started with the issue.

In the meantime I am going through the checker package present at tests/e2e/checker/rpcchecker.go

Apr 02 '23 11:04 prateek041

I have a few questions to get started:

1: What are the specific prometheus metrics we need to verify during e2e tests ? A simple example will work. 2: Requirements for the metrics checker ? 3: How will we integrate it with the existing test framework ?

@kkourt

Apr 05 '23 16:04 prateek041

Hey @prateek041, thanks for your interest in the project! Let me take a little time to come up with a more concrete list of requirements and I'll follow up here shortly.

Apr 06 '23 16:04 willfindlay

Thank you @willfindlay ! Really looking forward to contribute to the project.

Apr 06 '23 16:04 prateek041

@prateek041 Here's a rough answer for the above questions to get you started.

Let's say for now we're interested in verifying that some simple error metrics are zero. A good first example would be notify_overflowed_events (the number of perf events we have dropped). But let's have the checker be generic so that the user can specify the metrics they care about and what ranges they are expecting (zero, non-zero, less than or greater than n, etc).
I'm just spitballing here but ideally the workflow would be: 1. pull metrics per Tetragon pod; 2. aggregate them; 3. expose a builder that let's you write metrics check queries, similar to how we build eventcheckers now in the eventchecker package; 4. run the queries in a FeatureFunc (see below).
I would write a package similar to tests/e2e/checker that wraps our metrics checker and exposes it as a FeatureFunc that can be used in e2e tests. The best example from the checker package is here: https://github.com/cilium/tetragon/blob/2432502d2c7b4b1d0d755c824644ee3e5268947d/tests/e2e/checker/rpcchecker.go#L130-L177

Apr 06 '23 16:04 willfindlay

Bonus points if we can identify which pod(s) fail the checks in multi-node clusters. Could be useful for debugging a failed test.

Apr 06 '23 16:04 willfindlay

Thanks for sharing ! I will be working on the issue now.

Apr 06 '23 18:04 prateek041

After going through the code base, here is the list of metrics I found.

error metrics
event cache metrics
event metrics (contains the example of Notify overflowed events)
Kprobe metrics
map metrics
opcode metrics
process exec metrics
ringbuf metrics
watcher metrics

present here in the metrics package

Out of these metrics, for whom do we intent to write tests ? According to @willfindlay most of these need to be covered except probes.

Apr 10 '23 16:04 prateek041

I don't think we want to write new tests (yet). Rather I want to add these metrics checks to the existing tests.

Apr 21 '23 17:04 willfindlay

I am trying to understand how should I filter out the metrics that the user wants from all the metrics being exposed at /metrics path. Like one way to do it is using loops and matching selectedMetrics[], there is also this v1 package which I haven't yet fully tested, there might be more that I haven't come across but what is the recommended way here ?

I tried to go through rpc checker to understand how it is doing it, but since I don't know much about protobuf files, I am unable to understand much. Here is the function.

Here are some additional questions:

On my dev environment there is just one tetragon pod, so I can fetch the metrics from :2112/metrics but what about fetching from multiple pods ? Since you mentioned "pull metrics per tetragon pod", there is a multiplexer for GRPc should something similar be implemented for metrics too ?
You mentioned "expose a builder that let's you write queries", please elaborate that a bit more.

Apr 23 '23 10:04 prateek041

I am trying to understand how should I filter out the metrics that the user wants from all the metrics being exposed at /metrics path.

There are a couple of approaches that could work here. Figuring out which one to use is part of the exercise. The API client package you linked sounds promising.

there is a multiplexer for GRPc should something similar be implemented for metrics too ?

We will need the metricschecker to validate metrics for multi-node clusters. So we will need something like the multiplexer that should abstract over multiple metrics connections (one per pod).

You mentioned "expose a builder that let's you write queries", please elaborate that a bit more.

Just an API similar to the eventchecker. Something like NewMetricsChecker().LessThanOrEqual("ringbuf_dropped_count", 0) or similar.

Apr 24 '23 14:04 willfindlay

Hey @prateek041, just checking in. Any progress updates or questions from your side?

May 03 '23 17:05 willfindlay

Hello @willfindlay !

Couldn't work on the issue for two days due to bad weather conditions here. I have a few questions but I try hard to find as much answers on my own and ask 3-4 together so I don't take much of your time.

Updates:

I can successfully filter out the metrics based on what is asked for, here is the sample code. so I have an Idea of how to implement the checker now.
Currently I am trying to wrap my head around the flow in which the tests are running, so that I can attach the metricsChecker in the flow.
I believe runners.go is the file responsible for the flow of these tests ? I am going through it right now.

May 03 '23 18:05 prateek041

I can successfully filter out the metrics based on what is asked for, here is the sample code. so I have an Idea of how to implement the checker now.

Great news!

I believe runners.go is the file responsible for the flow of these tests ? I am going through it right now.

Yes that's correct. The Runner struct in that file essentially manages the flow of the tests and takes care of installing cilium/tetragon and forwarding whatever ports we need to forward to get the gRPC checkers etc. working correctly.

May 03 '23 18:05 willfindlay

Should I raise a draft PR so that every small piece can be reviewed and discussed ? @willfindlay

May 03 '23 18:05 prateek041

@prateek041 If you think you have enough concrete pieces, I'd be happy to take a look. Otherwise it's also fine to wait until you have a little more.

May 03 '23 18:05 willfindlay

Are you still working on this @prateek041?

Jan 15 '24 11:01 mtardy