tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Protobuf marshalling error when processing traffic over gRPC stream

Open inliquid opened this issue 11 months ago • 6 comments

What happened?

Issue affects our prod systems and constantly appears during load tests.

This was initially discovered when using own gRPC agent which consumes events from tetragon directly, but this could be easily reproduced using tetra.

In a container which is being monitored run:

while true; do cat /etc/pam.conf > /dev/null  && awk 'BEGIN {system("whoami")}' > /dev/null && sleep 0.25 || break; done

In tetragon container run:

tetra getevents --pods test-pod -o compact

This will fail after some time (~5-60 min) with following error:

<...>
🚀 process default/test-pod-debian /usr/bin/whoami
💥 exit    default/test-pod-debian /usr/bin/whoami  0
💥 exit    default/test-pod-debian /bin/sh -c whoami 0
💥 exit    default/test-pod-debian /usr/bin/awk  "BEGIN {system("whoami")}" 0
🚀 process default/test-pod-debian /usr/bin/sleep 0.25
time="2024-12-26T14:17:58Z" level=fatal msg="Failed to receive events" error="rpc error: code = Internal desc = grpc: error while marshaling: marshaling tetragon.GetEventsResponse: size mismatch (see https://github.com/golang/protobuf/issues/1609): calculated=0, measured=134"

This reproduces even without any Tracing Policy.

Tetragon Version

v1.1.2

Kernel Version

5.14.0-284.30.1.el9_2.x86_64

Kubernetes Version

v1.27.6

inliquid avatar Dec 26 '24 14:12 inliquid

I've seen this issue https://github.com/cilium/tetragon/issues/2875 but not sure is it same issue, or something specific to a particular test.

inliquid avatar Dec 26 '24 14:12 inliquid

Thanks for the report and the reproducing steps, indeed it's an issue we bumped into regularly and I think @will-isovalent investigated it a while ago and fixed it in some parts, he might have more context over it.

mtardy avatar Jan 02 '25 10:01 mtardy

Thanks @inliquid! Can you reproduce it without awk or is awk needed for the issue to happen?

My guess is that there is some race happening when filling the .Process section when we get out of order exec events. so having two very close exec events (via awk) to reproduce this might indicate that my suspicion above is correct.

kkourt avatar Jan 06 '25 13:01 kkourt

Hi @kkourt! I have run same test without awk again couple of times and it seems that error disappeared, so most likely this is connected as you said.

inliquid avatar Feb 11 '25 10:02 inliquid

Are there any plans for a fix?

inliquid avatar Mar 25 '25 16:03 inliquid

I tried to reproduce this on a bare-metal (non-k8s) machine but couldn't. @inliquid can you check that this is also the case on your side? That is, running the same command will not trigger the bug in a non-k8s cluster.

kkourt avatar Apr 11 '25 08:04 kkourt

This issue has more information so I'll keep this one that is essentially targeting the same underlying issue it seem. Here's the one marked as duplicate for more info:

  • https://github.com/cilium/tetragon/issues/2875

mtardy avatar Jul 16 '25 15:07 mtardy