libs
libs copied to clipboard
Idea: healing/patching mising process/container info on the fly
Motivation
One problem that comes up over and over is the problem of <NA> information for processes/containers. We believe most of the time, this results from an earlier dropped clone/vfork/fork/etc event that would normally create the new threadinfo struct. The drops might result from an under-resourced falco/sysdig instance, or some one-off stall in processing that causes the shared buffer between kernel/user space to overflow.
This can cascade to missing container info as well--part of sinsp_parser::parse_clone_exit involves launching a container lookup based on the thread's cgroup. If the clone is missed, the container lookup may never occur, so the container will be unresolved as well.
Currently, there's no way to heal this missing information, so we're stuck with stuff like ... proc.name=<NA> container.image=<NA> for all events involving that thread from that point forward.
I think we can do a better job at healing this missing information on the fly, instead of just dealing with the missing threadinfo/container info for the rest of the lifetime of the process.
Feature
In sinsp_thread_manager::get_thread_ref(), there's a query_os_if_not_found param that would search for a missing threadinfo in /proc. However, this lookup is fairly expensive to do on the event processing thread, so most of the time this param is set to false.
I think we should change this so the lookup happens asynchronously, like we already do for container information. The idea would be that the lookup happens asynchronously by reading all the required information from /proc, the lookup results in a synthetic event analagous to PPME_CONTAINER_JSON_E, and the synthetic event is queued to a queue held by the inspector. Calls to sinsp::next() check the queue and if found, that synthetic event is processed before the next driver event.
I think the synthetic event should have its own event type, so we can clearly identify when there was missing process information that needed an asynchronous lookup. But its params could be identical to PPME_SYSCALL_CLONE_20_X, with params containing all the essential information about the missing thread. That way, we can use the existing parsers code essentially unchanged (other than adding the new event type). That will result in the right side-effects of creating the new threadinfo and starting a container lookup if it happens to be in a cgroup.
We would want to add some safety controls to make sure that only one lookup occurs for a thread at a time, the overall number of async lookups is capped, and that a failed lookup for a tid isn't retried over and over.
We'd also need to handle race-conditions that can occur where the process exits before the async lookup is complete. But our threadinfo teardown path already has a lazy garbage-collection style cleanup path, so I think we may be okay.
I think this would go a long way towards fixing missing info on the fly in the face of dropped events.
I think we could extend this approach to FDs as well, btw. I think we have more pain at the moment related to threadinfo/container info, so I think we could tackle those first and see if it helps. If it does, then we could do the same for FDs.
Hi Mark!
I strongly agree with this; note moreover that changes in #220 already "renamed" the m_pending_container_evts queue to a more generic m_pending_state_evts; it is used for users/groups events too.
Therefore, expanding it to new events will be logically straightforward.
And i agree that the lookup should be async; or we could forcefully resync threadinfos from proc every N seconds.
I mean, i am not strongly against a new thread, of course, but that would work as well.
We could only re-scan from proc threads we miss info for; that should be less impactful and we could do it perhaps every 5seconds or so.
Note moreover, that this would also fix a weird issue with arm64 support: it seems like arm64 does never call sys_exit tracepoints for execve and clone(child). Therefore we would need a similar mechanism in place.
All in all, :+1: for this :) Thanks for coming up with the ideas and for opening the issue!
This is a great point and I think this would be a great contribution Mark 😄 . I would prefer Fede's proposal of expanding the m_pending_state_evts that is already in place if possible, so that we maintain a minimal number the synchronization points that can slow down the main next() thread. But this is more of an implementation detail, so I'll leave it to the implementer of this feature!
Another concern on top of my mind is how the new synthetic similar to PPME_CONTAINER_JSON_E will be handler with the scap file dumper/reader. I think we have a confusing design in which the API boundaries of linsinp and linscap get violated to write and replay container events, and it would be cool if we could make it better this time. Maybe there's no silver bullet for this, but I just wanted to raise the point.
Another concern on top of my mind is how the new synthetic similar to PPME_CONTAINER_JSON_E will be handler with the scap file dumper/reader.
I think the dumper/reader is exactly why you need an event to denote this information. Imagine a live capture where events are also being written to a capture file. The live capture misses some clone and therefore misses the process info. Assuming some kind of async lookup is in place, the missing process info will eventually be fixed by the async lookup, and later live events for that process will have process info. If you do not write some kind of event to the capture that contains the previously missing info about the process, a replay of the capture file will not have the missing process info. There's no opportunity to do any kind of live lookup during replay, of course.
I strongly agree with this; note moreover that changes in https://github.com/falcosecurity/libs/pull/220 already "renamed" the m_pending_container_evts queue to a more generic m_pending_state_evts; it is used for users/groups events too.
One argument for using separate queues is that it allows for separate asynchronous lookup threads that can be writing to different queues. If you use a single queue, all the writers will have to synchronize writes to the queue. Maybe this isn't so bad as queue writes will only occur as often as container/process lookups occur. We can discuss when I have a PR to comment on :)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Thanks @mstemm since I am already looking into the container engine, I can give it a try.
Have some new suspicions based on initial data from the new libsinsp state metrics. Let me observe a bit more and do some more checks. You are definitely onto something here, let's dive into it more.
/assign
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Provide feedback via https://github.com/falcosecurity/community. /close
@poiana: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with
/reopen.Mark the issue as fresh with
/remove-lifecycle rotten.Provide feedback via https://github.com/falcosecurity/community. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten