tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

feat: include ancestors in process events

Open t0x01 opened this issue 1 year ago • 4 comments

Fixes 2420

Description

Reason: Option to include all ancestors of the process in process events can be very useful for observability and filtering purposes. E.g. to apply complex correlation rules later in data processing pipeline, or to filter out extra events.

Changes made:

  • Read and set option enable-process-ancestors from the config file. Turn option enable-process-ancestors off by default.
  • If option enable-process-ancestors is set, try to include ancestors (up to PID 1/PID 2) of the process beyond the immediate parent in process_exec, process_exit, process_uprobe, process_kprobe, process_lsm, process_tracepoint events in a respective protobuf message for the given process.
  • If option enable-process-ancestors is set and there was an error when trying to include process' ancestors in the protobuf message, add the event to eventcache for reprocessing.
  • When trying to reprocess events from eventcache, if option enable-process-ancestors is set and Ancestors is nil, try to include process' ancestors again.
  • Implement a new export filter that can filter over ancestor binary names using RE2 regular expressions.
  • Add a new test for the new export filter.
  • Add information about new features to documentation.

t0x01 avatar Sep 19 '24 14:09 t0x01

Deploy Preview for tetragon ready!

Name Link
Latest commit 91dfc9865259c8bb4b8a52741cc553d97bdba4e7
Latest deploy log https://app.netlify.com/sites/tetragon/deploys/67adee0fc390b300088c5bf3
Deploy Preview https://deploy-preview-2938--tetragon.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Sep 19 '24 14:09 netlify[bot]

Hello.

Upd: found even more problems, converting back to draft for now.

t0x01 avatar Sep 26 '24 08:09 t0x01

I'll look shortly sorry was travelling and then catching up. Should be able to get to this today or tomorrow thanks!

jrfastab avatar Sep 30 '24 18:09 jrfastab

I think i misunderstood the purpose of both MsgProcessCleanupEventUnix and refcnt initially, so i no longer call RefInc / RefDec for ancestors. Sorry for that.

So now there seems to be no real reason to return []*ProcessInternal in GetAncestorProcessesInternal instead of []*tetragon.Process. And, if i change that, it will solve the double loop problem as well. But i'm not sure if i should actually do that, because returning []*ProcessInternal may be beneficial in the future for reasons i don't yet see.

The biggest obstacle now is that due to current implementation of process cleanup, as described in commit 45745a0, it becomes impossible to reconstruct full process ancestry in some cases. Assume the following scenario:

  clone()             id=1
    exec()            id=2 [will also cleanup id=1]
    exec()            id=3 [will also cleanup id=2]
    ...
    exec()            id=n-1 [will also cleanup id=n-2]
    exec()            id=n [will also cleanup id=n-1]
  exit()              id=n

If n > 3, then all processes with id < n-1 would have their refcnt set to 0 and as a result removed from the event cache, breaking the ancestry chain. I don't think i can resolve that without introducing any potentially breaking changes, as this PR already has quite a lot of code to review.

And there is still inconsistency in the ancestors field's value across protobuf messages in api/v1/tetragon/tetragon.proto and vendor/github.com/cilium/tetragon/api/v1/tetragon/tetragon.proto. I'm still not sure how to properly handle that.

@jrfastab, may I ask you for a code review now? Sorry for the delay.

t0x01 avatar Oct 10 '24 10:10 t0x01

Hello @olsajiri @tpapagian .

I made some changes based on your suggestions:

  • GetAncestorProcesses, GetAncestorProcessesInternal now return an ancestors slice and an error. If we encounter an error, the returned ancestors slice will contain all ancestors we were able to collect up to that moment.
  • Conditions for ancestors collection and adding events to the eventcache moved to NeededAncestors, NeededAncestorsInternal and NeededAncestors.
  • MsgExecveEventUnix and MsgCloneEventUnix events now increase reference counter of all ancestors if enable-process-ancestors option is set. MsgExitEventUnix and MsgProcessCleanupEventUnix - decrease reference counter of all ancestors if enable-process-ancestors option is set.

I also renamed some variables for the sake of consistency.

GetAncestorProcesses/GetAncestorProcessesInternal and NeededAncestors/NeededAncestorsInternal are there just because in some functions we only need *tetragon.Process ancestors, but not *process.ProcessInternal ancestors. This might not be the best approach, so, please, tell me if i need to change it.

t0x01 avatar Dec 12 '24 12:12 t0x01

I made some changes based on your suggestions:

Thanks for the update, I will review that possibly on Monday.

tpapagian avatar Dec 13 '24 08:12 tpapagian

Thanks for taking the time to add tests on the reference counting and sorry for the delay on reviewing that again.

I have run several manual stress tests to check also that reference counting is correct and everything seems fine.

One possibly last comment is that now you add ancestors to all event types. This can easily result in very large events and in a heavy loaded server this would also mean that we would need a large amount of storage to keep those.

I believe that adding ancestors only in process_exec events would be enough for most of the cases. Because we can then easily correlate all other events with the corresponding process_exec event and have all of its ancestors without the need to have duplicate information. Does this sound reasonable?

If you believe that we need ancestors in all events, then I would propose to have a separate command line flag to enable those selectively. i.e. --enable-process-exec-ancestors, --enable-process-exit-ancestors, --enable-process-kprobe-ancestors etc.

tpapagian avatar Jan 22 '25 13:01 tpapagian

One possibly last comment is that now you add ancestors to all event types. This can easily result in very large events and in a heavy loaded server this would also mean that we would need a large amount of storage to keep those.

I believe that adding ancestors only in process_exec events would be enough for most of the cases. Because we can then easily correlate all other events with the corresponding process_exec event and have all of its ancestors without the need to have duplicate information. Does this sound reasonable?

If you believe that we need ancestors in all events, then I would propose to have a separate command line flag to enable those selectively. i.e. --enable-process-exec-ancestors, --enable-process-exit-ancestors, --enable-process-kprobe-ancestors etc.

I would prefer having separate command line flags then. I mainly use tetragon with SIEM/XDR like systems, and i would like to have an option to have ancestor processes included in all mentioned events, so i can use them in correlation rules logic inside these systems. Additionally, i would like to also have ancestor processes for at least some types of these events to be able to filter out some of them with export-denylist filters.

I agree that we can populate ancestors in all other events from corresponding process_exec event. But, for example, there can be cases where i want to filter out process_exec events and only have process_kprobe events exported for some processes. So in these cases i would not have an ability to populate ancestors later in a pipeline this way. I think that having separate options should be better, as it would provide more control over what types of events we need to populate with ancestors / export / filter out in different situations.

If this is fine, i think i can adjust my code to use separate options, but i'll need some time to implement and test it properly. One more question, though: can i at least combine --enable-process-exec-ancestors and --enable-process-exit-ancestors into one option? I'll have to collect ancestors in both these types of events anyway, if any of these 2 options are enabled, since i'll have to call RefInc/RefDec for ancestors inside both of them. If ancestors would not be needed for process_exec or process_exit events, i think field-filters can be used to remove them.

t0x01 avatar Jan 23 '25 15:01 t0x01

If this is fine, i think i can adjust my code to use separate options, but i'll need some time to implement and test it properly.

Yes, I am fine with that as long as there is a use case in your side that can be beneficial.

can i at least combine --enable-process-exec-ancestors and --enable-process-exit-ancestors into one option?

Yes, I think this is fine as well. So --enable-process-ancestors would enable ancestors in both exec and exit events. And for all the other cases, we will have a separate flag.

If ancestors would not be needed for process_exec or process_exit events, i think field-filters can be used to remove them.

Correct, and this could work in other cases as well (i.e. kprobes). But I want to avoid the overhead of collecting ancestors and then dropping them. In the exit event as you said you have to collect them and do the refDec. So there is not unnecessary overhead there.

tpapagian avatar Jan 24 '25 08:01 tpapagian

Thanks for the resilience @t0x01: "started on on Sep 19, 2024". And thanks @tpapagian and @olsajiri for the support. :)

mtardy avatar Feb 14 '25 08:02 mtardy