tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Cannot merge two enter events error

Open Taction opened this issue 3 years ago • 4 comments

I use minikube and install tetragon with helm. I run the File Access example in the readme, after applied ./crds/examples/sys_write_follow_fd_prefix.yaml, the output of command kubectl logs -n kube-system ds/tetragon -c export-stdout -f | tetragon observe --namespace default --pod xwing works as expected. But I see the following log in the tetragon container.

time="2022-06-14T06:58:09Z" level=warning msg="cannot merge two enter events: prev:{ev:0xc001124b40 returnEvent:false} curr:{ev:0xc001124be0 returnEvent:false}"
time="2022-06-14T06:58:09Z" level=warning msg="cannot merge two enter events: prev:{ev:0xc001124d20 returnEvent:false} curr:{ev:0xc001124dc0 returnEvent:false}"
$ kubectl get tracingpolicies.cilium.io -A
NAME                     AGE
sys-read-follow-prefix   21h

Taction avatar Jun 14 '22 07:06 Taction

Thanks for the report!

One known reason for this error is when events are dropped. This will happen, for example, if the tetragon agent is unable to keep pace with the produced events and consume them,

Can you please check the metrics and verify that this is the case when the above error happens? (should be curl localhost:54321/metrics)

kkourt avatar Jun 14 '22 09:06 kkourt

Thanks for your reply!

I noticed that 54321 is a grpc server(seems more related to config control), and 2112 is a metrics server. So here is the result of curl localhost:2112/metrics. Since I am unfamiliar with these metrics, can you please look at them?

b2.log

Taction avatar Jun 15 '22 02:06 Taction

Thanks for your reply!

I noticed that 54321 is a grpc server(seems more related to config control), and 2112 is a metrics server. So here is the result of curl localhost:2112/metrics.

Thanks for catching that!

Since I am unfamiliar with these metrics, can you please look at them?

b2.log

Happy to!

I was referring to the BPF ringbuff metrics, which you can find here: https://github.com/cilium/tetragon/blob/a711e8371df678a2844b1c180931224de6372c0f/pkg/metrics/ringbufmetrics/ringbufmetrics.go#L13-L27

But in the log you provided, there do not seem to be any lost events or errors:

$ grep ringbuf <b2.log
# HELP isovalent_ringbuf_perf_event_received The total number of TETRAGON ringbuf perf events received.
# TYPE isovalent_ringbuf_perf_event_received gauge
isovalent_ringbuf_perf_event_received 21872

Thinking more about this, I remember that we had some patches with @olsajiri trying to address these issues. I'll see if we can dig them out.

kkourt avatar Jun 15 '22 07:06 kkourt

Here is the result of grep ringbuf and grep errors.

curl localhost:2112/metrics | grep ringbuf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38019    0 38019    0     0   317k      0 --:--:-- --:--:-- --:--:--  317k
# HELP isovalent_ringbuf_perf_event_received The total number of TETRAGON ringbuf perf events received.
# TYPE isovalent_ringbuf_perf_event_received gauge
isovalent_ringbuf_perf_event_received 1.937101e+06
curl localhost:2112/metrics | grep errors
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16202    0 16202    0     0   102k      0 --:--:-- --:--:-- --:--:--  101k# HELP isovalent_errors_total The total number of TETRAGON errors. For internal use only.
# TYPE isovalent_errors_total counter
isovalent_errors_total{type="exec_missing_parent"} 641
isovalent_errors_total{type="no_parent_no_clone"} 540
isovalent_errors_total{type="process_cache_evicted"} 261446
isovalent_errors_total{type="process_cache_miss_on_get"} 1798
# HELP isovalent_event_cache The total number of TETRAGON event cache access/errors. For internal use only.

Taction avatar Jun 15 '22 09:06 Taction

I believe this issue is fixed by https://github.com/cilium/tetragon/pull/422. @Taction please re-open it if that's not the case.

kkourt avatar May 15 '23 13:05 kkourt