Protobuf marshalling error when processing traffic over gRPC stream
What happened?
Issue affects our prod systems and constantly appears during load tests.
This was initially discovered when using own gRPC agent which consumes events from tetragon directly, but this could be easily reproduced using tetra.
In a container which is being monitored run:
while true; do cat /etc/pam.conf > /dev/null && awk 'BEGIN {system("whoami")}' > /dev/null && sleep 0.25 || break; done
In tetragon container run:
tetra getevents --pods test-pod -o compact
This will fail after some time (~5-60 min) with following error:
<...>
🚀 process default/test-pod-debian /usr/bin/whoami
💥 exit default/test-pod-debian /usr/bin/whoami 0
💥 exit default/test-pod-debian /bin/sh -c whoami 0
💥 exit default/test-pod-debian /usr/bin/awk "BEGIN {system("whoami")}" 0
🚀 process default/test-pod-debian /usr/bin/sleep 0.25
time="2024-12-26T14:17:58Z" level=fatal msg="Failed to receive events" error="rpc error: code = Internal desc = grpc: error while marshaling: marshaling tetragon.GetEventsResponse: size mismatch (see https://github.com/golang/protobuf/issues/1609): calculated=0, measured=134"
This reproduces even without any Tracing Policy.
Tetragon Version
v1.1.2
Kernel Version
5.14.0-284.30.1.el9_2.x86_64
Kubernetes Version
v1.27.6
I've seen this issue https://github.com/cilium/tetragon/issues/2875 but not sure is it same issue, or something specific to a particular test.
Thanks for the report and the reproducing steps, indeed it's an issue we bumped into regularly and I think @will-isovalent investigated it a while ago and fixed it in some parts, he might have more context over it.
Thanks @inliquid! Can you reproduce it without awk or is awk needed for the issue to happen?
My guess is that there is some race happening when filling the .Process section when we get out of order exec events. so having two very close exec events (via awk) to reproduce this might indicate that my suspicion above is correct.
Hi @kkourt! I have run same test without awk again couple of times and it seems that error disappeared, so most likely this is connected as you said.
Are there any plans for a fix?
I tried to reproduce this on a bare-metal (non-k8s) machine but couldn't. @inliquid can you check that this is also the case on your side? That is, running the same command will not trigger the bug in a non-k8s cluster.
This issue has more information so I'll keep this one that is essentially targeting the same underlying issue it seem. Here's the one marked as duplicate for more info:
- https://github.com/cilium/tetragon/issues/2875