libs
libs copied to clipboard
Add kernel signals exe_ino, exe_ino_ctime, exe_ino_mtime, pidns_init_start_ts + derived filter fields
What type of PR is this?
Uncomment one (or more)
/kind <>
lines:
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
Any specific area of the project related to this PR?
Uncomment one (or more)
/area <>
lines:
/area API-version
/area build
/area CI
/area driver-kmod
/area driver-bpf
/area driver-modern-bpf
/area libscap-engine-bpf
/area libscap-engine-gvisor
/area libscap-engine-kmod
/area libscap-engine-modern-bpf
/area libscap-engine-nodriver
/area libscap-engine-noop
/area libscap-engine-source-plugin
/area libscap-engine-savefile
/area libscap-engine-udig
/area libscap
/area libpman
/area libsinsp
/area tests
/area proposals
Does this PR require a change in the driver versions?
/version driver-API-version-major
/version driver-API-version-minor
/version driver-API-version-patch
/version driver-SCHEMA-version-major
/version driver-SCHEMA-version-minor
/version driver-SCHEMA-version-patch
What this PR does / why we need it:
Dropping an implant, making the file executable and executing the implant is amongst one of the oldest tricks. While memory based cyber attacks mostly circumvent touching disk, reliably detecting drifts, that is, a suspicious new executable is executed is often considered a crucial baseline detection.
Falco's upstream rules "Container Drift Detected (chmod)"
and "Container Drift Detected (open+create)"
aim to detect the creation of a new executable in a container (drift). However, both rules are disabled by default, because those rules can be noisy in un-profiled environments and workloads. Finally, currently there are no easy or robust mechanisms to correlate above rules that are based on file operation events with the events where the executable is run (execve
).
This PR attempts to address this gap via adding enhanced kernel signals to spawned processes. While the proposed signals won't replace the need to monitor file operation events, they can help reduce the search space for tracking spawned processes where for example chmod +x
was run against the executable file on disk prior to execution (this causes ctime
of inode
to change, but we don't know if it was chmod related or a different status change operation). In addition, end users could use these fields for selected rules to augment information available for incident response.
New derived filter fields based on new kernel signals
"proc.exe_ino"
, "Inode number of executable image file on disk", "The inode number of the executable image file on disk. Can be correlated with fd.ino."
"proc.exe_ino.ctime"
, "Last status change time (ctime - epoch ns) of exe file on disk", "Last status change time (ctime - epoch nanoseconds) of executable image file on disk (inode->ctime). Time is changed by writing or by setting inode information e.g. owner, group, link count, mode etc."
"proc.exe_ino.mtime"
, "Last modification time (mtime - epoch ns) of exe file on disk", "Last modification time (mtime - epoch nanoseconds) of executable image file on disk (inode->mtime). Time is changed by file modifications, e.g. by mknod, truncate, utime, write of more than zero bytes etc. For tracking changes in owner, group, link count or mode, use proc.exe_ino.ctime instead."
"proc.exe_ino.ctime_duration_proc_start"
, "Number of nanoseconds between ctime exe file and proc clone ts", "Number of nanoseconds between modifying status of executable image and spawning a new process using the changed executable image."
"proc.exe_ino.ctime_duration_pidns_start"
, "Number of nanoseconds between pidns start ts and ctime exe file", "Number of nanoseconds between pid namespace start ts and ctime exe file if pidns start predates ctime."
"proc.pidns_init_start_ts"
, "Start ts of pid namespace (epoch ns)", "Start ts (epoch ns) of pid namespace; approximate start ts of container if pid in container or start ts of host if pid in host namespace."
"container.start_ts"
, "Container start ts (epoch in ns)", "Container start ts (epoch in ns) based on proc.pidns_init_start_ts."
"container.duration"
, "Number of nanoseconds since the container start ts", "Number of nanoseconds since the container start ts."
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Includes cleanup, mainly make sched_prog_exec_4
and execve_family_flags
filler alike in terms of style. Refactored (no logic changes) get_exe_writable
to avoid few redundant _READ()
s on same kernel structures within the same filler (@LucaGuerra).
This PR is not yet ready. Hoping for some early feedback to make these new signals better :)
Checklist (this PR)
- [x] container.duration
- [x] proc.exe_ino.ctime_duration_pidns_start (how much time has passed since container start or host re-boot until status change of exe file on disk)
- [x] more ideas / different ideas
- https://github.com/falcosecurity/libs/issues/615
- https://github.com/falcosecurity/libs/issues/621
- [x] kmod
- [x] modern_bpf
- [x] scap procs, scap file
- [x] bump driver schema version
Checklist (future PR)
- Initial attempt for on host anomaly detection for container drift use case (expectation would be to first PR a proposal doc as this would mark a significant new feature)
Does this PR introduce a user-facing change?:
Add kernel signals exe_ino, exe_ino_ctime, exe_ino_mtime, pidns_init_start_ts + derived filter fields
This is a huge PR @incertum! A small note that must be addressed before leaving "wip" state:
- i think
scap_procs
must be updated fetching new info from proc, if we can, right?
Aside from this, it looks really cool, thank you!
Hello @incertum! The container drift detection seems to be something really significant, thanks for spending time on it and trying to detect this kind of behavior! Since I see in your checklist that you are open to discuss also different ideas, although this one is really cool, I want to ask you what you think about this other approach that I came up with and described here:
https://github.com/falcosecurity/libs/pull/287
The main problem of this approach is that it relies on overlayfs and so it cannot work with old kernels and container runtimes that do not use it. It also needs to be tested across a wider variety of kernels to be sure that it's working, since it was like an experiment for me. I would be happy to know what you think about it!
@FedeDP ty let me look into scap-procs
- once feature complete will implement this for modern_bpf, scap file and kmod, always leave the kmod fun for the end :) Also still need to test this on more kernel versions and distros than just the one I was quickly developing on ...
@loresuso was actually lurking around that overlayfs
PR a good while ago. Thanks for experimenting :heart:! In general I believe more and stronger kernel signals just like the one you proposed are needed, let's chat more.
What is needed to merge it? I approve, really nice work and think this is an excellent feature that adds even more signal for the container use case and I think it's ok that it doesn't work for super old kernels etc. Besides containers would also be interested in nailing this for bare-metal hosts.
@loresuso more signals are needed for detecting memory attacks or RCE in a more general and robust way (executables are just one aspect), one step at a time though. And I saw you also refactored the get_exe_writable
and created similar get_exe_inode
lol, we can sync on how to merge this cleanup into one approach.
Also once all new kernel signals we can come up with at the moment are merged, wanna team up on creating a strong and robust userspace logic to nail it? Would be amazing if some rules come out at the other end that can be enabled by default aka they can work in unknown environments. Called it anomaly detection, but we can also call it advanced signal correlation etc :upside_down_face:.
Re fetching the container start time or pid namespace creation time works too still monkeying around if this is best implemented kernel side. Something like somehow fetching the start time of pid=1 as seen from process namespace or the creation ts of the pid namespace the process belongs too ... would you have any thoughts on this?
Hey folks, I'd like to add my thoughts to the discussion since I originally introduced the is_exe_writable
flag for this purpose, discussed a lot with Lorenzo about its evolution is_exe_upper_layer
and am very interested in basically catching suspicious executions. While it's true that attacks can be fully in memory (which would bypass any file-based rule of course) we all know that a defense-in-depth strategy needs to consider many cases. Also, I expect the most common attacks to be indeed file based. This is a bit of a larger discussion that we may want to expand somewhere.
Regarding attack scenarios, the proposed fields would allow us to add another way to filter events to try and reduce the noise from this kind of rules. I would love for Falco to be able to have a set of rules to deal with the standard "drop + execute" case. This is what comes to mind:
- In containers you can do, depending on how your container is built, one of two things: you either use
is_exe_writable
with containers that runs as regular user but has executable files normally owned by root (this is the default if you run as user!) oris_exe_upper_layer
which works with containers executed as root as well 😎 This alerts for new executables at all times. - On hosts in my opinion the best bet is
is_exe_writable
and inspect non-root users I think because root on a host does way too many things 😭 . Installing and updating software is common, downloading and running software happens often during normal deployments ... So many regular actions would trigger drop+execute that may make this pretty useless :/ But in some deployments it's not expected for regular users to bring their own binaries, and that's what I would want to catch. Also, remember that true root can change mtime and ctime of all files if it wants.
@incertum 's idea I think is definitely clever, as it allows us to add the time dimension to the above. You can say "If a regular user is running an executable that they can modify AND it has been modified 'recently', then alert". This allows us to detect drops without drowning in noise caused by system-wide software updates and new deployments. Same goes for containers. In that case I like the stronger properties of is_exe_upper_layer
because you can't easily evade it if you're inside a container. Even if you drop a file today and schedule its execution at some other time it will be caught.
In conclusion, I probably want all of these fields 😎 I actually wanted is_exe_upper_layer
in 0.33.0 but there's so much content that is going into that release that we probably want to merge it right after the release so we have time to test it and see that it doesn't break too many things (every new thing happening at process start as you could see is a little tricky...). Does it make sense to you? The first step as you mentioned could be to refactor and generalize the exe inode data collection in the kernel and ebpf.
@LucaGuerra ❤️ 😎 as always a fantastic summary and technical assessment of what the actual problem here is. Fully agree that all these signals combined will be super valuable in addition to existing metadata fields. It's nice to see three folks having come to similar conclusions, that is, (1) it is at process startup where we need to fetch better kernel signals and (2) this old problem "drop+execute" has not yet been well addressed.
Of course the "host" is the more tricky one, doesn't change the fact that I have been asked to fix / solve this ... so thinking we won't get away without determining a pattern of past behavior of the applications that are running, and analyze behaviors outside the past behavior. There will be both data modeling challenges and software implementation challenges, the good news is similar problems have been solved in the industry before and we can build upon this. Needless to say let's start more basic and iterate.
How about first merging @loresuso PR that features is_exe_upper_layer
after the upcoming release freeze, I'll continue monkeying around a bit for next 2 weeks and see if maybe there are more kernel signals that could be valuable. Perhaps you stumble across something new as well 🙃 that would be cool.
After everything is merged we collaborate on a fresh PR that just does userspace modeling? Also happy to offer deploying a prototype to production to be able to better assess how well it may work and also check that Falco does not deteriorate in case we introduce some significant new userspace features.
... Also, I expect the most common attacks to be indeed file based. This is a bit of a larger discussion that we may want to expand somewhere.
Would you have ideas re what the best forum would be to expand on those Threat Modeling discussions?
Also, remember that true root can change mtime and ctime of all files if it wants.
Yeah you can never just have nice things in security, hence why I am a big fan of multi-signal correlations.
Thanks @incertum @LucaGuerra, this conversation is getting more and more interesting!
I strongly agree that all these signals combined together are needed to improve the detection capabilities of the drop+execute pattern. So, soon after the release, I'll try my best to get the exe_upper_layer
merged. Some help in testing it better before the merge would be really appreciated!
Also, I wanted to say that I am thrilled to team up altogether to discuss how to improve the detection capabilities of Falco with these new signals.
I also believe that we have to expand the conversation (maybe in Slack or a Github issue?) to other attack patterns. I think we may want to research a bit on fileless execution (especially the one implemented with memfd_create
. Execution from tmpfs
) and post container escapes behaviors (like accessing files outside overlayfs from not mounted fs). I think these patterns are widespread too nowadays and I have some ideas that I would love to share with you!
Edited: We have moved all brainstorming to https://github.com/falcosecurity/libs/issues/615 in order to keep this PR focused.
Kernel side solution for robustness reasons: Add pid namespace init task start ts to generically approx container or host start ts and compute time deltas useful for detections, such as container duration or duration between pidns start ts and ctime exe file if pidns ts predates ctime. A general detection use case can be that if suspicious events happen in multiple containers of a deployment near container start it's more likely to be "normal". The longer a container runs the longer it is "exposed".
What questions do you have re the proposed approach to solve above? Would it be possible to check soundness of this approach? This would be much appreciated. Initial experimentation seemed correct ts values for various scenarios, but will continue testing.
Another kernel side signal that would like to look into and possibly add to this PR would be:
"Interpreter scripts"
aka text files with execute permissions (see https://man7.org/linux/man-pages/man2/execve.2.html)
For example chmod +x a.sh && ./a.sh
or chmod +x a.sh && exec ./a.sh
is currently logged as "proc.exepath":"/tmp/a.sh","proc.name":"a.sh","proc.cmdline":"a.sh ./a.sh"
, but the interpreter was configured as #! /bin/sh
and we wouldn't know what interpreter binary ran the script directly or that it was not a binary without inferring from extension if even available and we know how fragile that is.
Please note, not talking about the use case where you run the interpreter and pass the script, like /bin/sh a.sh
would give "proc.exepath":"/bin/sh","proc.name":"sh","proc.cmdline":"sh a.sh"
.
Any thoughts on above? @LucaGuerra @loresuso @FedeDP @Andreagit97
After that this PR should be feature complete and can start finalizing it, followed by code optimization review.
This PR is not ready for review.
@LucaGuerra and @loresuso in addition to VM tests, deployed these changes and @loresuso changes from https://github.com/falcosecurity/libs/pull/287 is_exe_upper_layer
to production (eBPF only). Has been running for 2 weeks now and seems stable, also no unwanted CPU or memory usage increases.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: incertum
Once this PR has been reviewed and has the lgtm label, please assign gnosek for approval by writing /assign @gnosek
in a comment. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
ei @incertum could you rebase this on master?
Thanks for the review @Andreagit97 . Hope I got all the minor improvements properly implemented. In addition, updated the modern_bpf tests.
To the best of my knowledge and based on empirical perf measures in prod clone/execve* families of syscalls seem super light and are not moving the needle in terms of perf overhead compared to the heavy / high frequency syscalls. Perhaps it's fair to even anticipate that adding another 5-7 params for clone/execve* families should be no problem and they would have the potential to really boost detection capabilities and accuracy. For example, near-term @loresuso plans to work on "memfd+exec" execve flag. Also in bpf we still need to fix uid etc lookups as they do not account for user namespaces yet, so we may need to jump then anyways to another tail call.
Suggestion: What would you think of redefining clone/fork/execve* schemas? -> balance params a bit so that we can try to get rid of a few lookups that are performed a few times and in addition implement pushdown optimizations, such as that if pid <-> vpid is the same can also skip the pidns start ts lookup kernel side, right now those are in different fillers .. and there may be more such minor optimizations possible? If we were to do that, probably worth to check what other params are needed before changing the schema.
To the best of my knowledge and based on empirical perf measures in prod clone/execve* families of syscalls seem super light and are not moving the needle in terms of perf overhead compared to the heavy / high frequency syscalls.
To be honest, I thought these process syscalls were high-frequency in a production system, good to know this is not your case :)
Suggestion: What would you think of redefining clone/fork/execve* schemas? -> balance params a bit so that we can try to get rid of a few lookups that are performed a few times and in addition implement pushdown optimizations, such as that if pid <-> vpid is the same can also skip the pidns start ts lookup kernel side, right now those are in different fillers .. and there may be more such minor optimizations possible? If we were to do that, probably worth to check what other params are needed before changing the schema.
This could be an excellent idea if we are able to detect some useless parameters or some huge optimization. BTW in this direction, we have already a possible work with https://github.com/falcosecurity/libs/pull/526, before this, we have just to define a little bit better scap-file format to not break compatibility with old captures
Hi @incertum
Thanks for the review @Andreagit97 . Hope I got all the minor improvements properly implemented. In addition, updated the modern_bpf tests.
That's great! I already gave your PR with the latest changes a run on s390x
and all tests passed successfully!
To the best of my knowledge and based on empirical perf measures in prod clone/execve* families of syscalls seem super light and are not moving the needle in terms of perf overhead compared to the heavy / high frequency syscalls.
To be honest, I thought these process syscalls were high-frequency in a production system, good to know this is not your case :)
I think it depends on the workload running on those systems and how much the systems are being utilized.
@Andreagit97 will follow up with opening a ticket re brainstorming for future optimization ...
Also noticed I missed param 21 in sched fork in kmod, now fixed.
ei @incertum thank you for the changes I will take another look ASAP in the meanwhile, there is an issue compiling the kernel module version 2.6.32 as you can see from the CI job, the problem should be here
2022-11-30T11:31:25.3809926Z [37mDEBU[0m /tmp/driver/ppm_fillers.c:1157:9: error: incompatible types when assigning to type 'long long unsigned int' from type 'struct timespec'
2022-11-30T11:31:25.3810711Z [37mDEBU[0m time = child_reaper->start_time;
2022-11-30T11:31:25.3811816Z [37mDEBU[0m ^
2022-11-30T11:31:25.4193261Z [37mDEBU[0m ������Mmake[2]: *** [scripts/Makefile.build:230: /tmp/driver/ppm_fillers.o] Error 1
2022-11-30T11:31:25.4200441Z [37mDEBU[0m ������)make[1]: Leaving directory '/tmp/kernel'
2022-11-30T11:31:25.4202657Z [37mDEBU[0m ������:make[1]: *** [Makefile:1448: _module_/tmp/driver] Error 2
I will try to fix broken tests in the next few days :)
I will try to fix broken tests in the next few days :)
Thanks so much Andrea, this is very much appreciated :rocket: !!!
/milestone 0.11.0
Thanks a lot for the great work refactoring this PR @incertum and @Andreagit97 ! I went through the e2e flow again (I know there is some ongoing refactoring in the kmod). I only have a concern about the use of get_host_boot_time_ns
which I'm going to detail in the code.
Drivers jobs should be fixed I expect a failure from windows/macos as @LucaGuerra pointed out For e2e fix, we need a rebase on master, if you want I can do that but I will become the committer of all commits :/
The last commit should fix windows CI, just removed scap_get_host_boot_time_ns()
helper since we already have scap_get_boot_time()
:)
The last commit should fix windows CI, just removed
scap_get_host_boot_time_ns()
helper since we already havescap_get_boot_time()
:)
lol classic, thanks for keeping the new more reliable method to get a constant boot ts :)
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: FedeDP, incertum, LucaGuerra
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [FedeDP,LucaGuerra]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment