go-libaudit icon indicating copy to clipboard operation
go-libaudit copied to clipboard

auditbeat ERROR: get status request failed:failed to get audit status reply: no reply received

Open mdnfiras opened this issue 3 years ago • 6 comments

original issue: https://github.com/elastic/beats/issues/33258

long story short: we run auditbeat as DaemonSet on GKE clusters with slightly different versions, some nodes run docker, other nodes run containerd.

it runs with all permissions it needs, journald already unregistered by an initContainer so auditbeat can get audit events. Problem is that some random auditbeat pods keep outputting this error until we restart them:

ERROR: get status request failed:failed to get audit status reply: no reply received

and if we restart a totally fine auditbeat pod, it might start outputting that error too.

it doesn't however stop writing audit logs to elasticsearch. we get audit logs from the pods that are outputting the error as much as the other pods.

I traced down the error to this block of code: https://github.com/elastic/go-libaudit/blob/6fba496da1d8846f7b00fecf719fb5aa43f0e91d/audit.go#L496-L498

Wouldn't it be okay if msgs was empty? At this point we already got through this without any error: https://github.com/elastic/go-libaudit/blob/6fba496da1d8846f7b00fecf719fb5aa43f0e91d/audit.go#L480-L494

and func (c *NetlinkClient) Receive() already got the appropriate error checks here: https://github.com/elastic/go-libaudit/blob/6fba496da1d8846f7b00fecf719fb5aa43f0e91d/netlink.go#L152-L190

Shouldn't len(msgs) == 0 be reported as a warning instead of an error?

mdnfiras avatar Oct 19 '22 10:10 mdnfiras

We could define the error returned by getReply as a warning only sentinel, but it would be good to get an understanding of why it is that the systems that you are running demonstrate this behaviour. The only path that explains this is when *NetlinkClient.Receive keeps getting EINTR or EAGAIN from syscall.Recvfrom Do you have any ideas why your hosts would be either not sending the messages or would be seeing heavy use of interrupts? Can you determine which of these is the case?

efd6 avatar Nov 14 '22 01:11 efd6

@leweafan Can you clarify?

efd6 avatar Nov 30 '22 20:11 efd6

Thanks. Can you explain how that relates to this issue? I think I am missing something.

efd6 avatar Dec 01 '22 10:12 efd6