libs
libs copied to clipboard
Kernel Version Testing Framework improvements
See #1191
These are some improvements that need to land for kernel version testing framework.
Needed before "v1":
- [x] Donate https://github.com/alacuku/e2e-falco-tests to falcosecurity (tracked by https://github.com/falcosecurity/evolution/issues/282)
- [x] Copy all images used by the matrix (ie: https://github.com/falcosecurity/kernel-testing/blob/main/ansible-playbooks/group_vars/all/vars.yml#L18) under the falcosecurity dockerhub repo
- [x] add a CI on kernel-testing repo to automatically push images (https://github.com/falcosecurity/kernel-testing/pull/70/files; NEEDS new self-hosted runners deployed to support
kernel-testingrepo)
v2 stuff
- [ ] Switch to
drivers_testexecutable instead of scap-open to also verify drivers correct behavior - [ ] Terraform for nodes deployment
- [ ] Cache ignite root somehow (ie: only rebuild the ignite root used for the VMs when changes to dockerfiles are made); this would greatly speed up tests duration
- [x] attach the matrixes markdown to the github release page for new driver releases (https://github.com/falcosecurity/libs/pull/1238)
- [ ] ~~upstream our ignite patch from https://github.com/therealbobo/ignite~~ upstream project is archived
- [ ] avoid using any weaveworks docker images as weaveworks is shutting down:
- [ ]
weaveworks/ignite-kernel:5.14.16 - [ ]
weaveworks/ubuntu-kernel:5.14.16
- [ ]
Future ideas
- [ ] Automatically fetch needed info (kernel images, modules and so on) from kernel-crawler
- [ ] Automatically build input test matrix (ie: list of images to be tested) given weekly kernel-crawler output (ie: add eg: 1 image per each crawled distro each week, enlarging our input test matrix)
- [ ] make ignite concurrent (right now it does not support concurrent runs at all, preventing us to add kernel tests to PR ci)
Copy all images used by the matrix (ie: https://github.com/falcosecurity/kernel-testing/blob/main/ansible-playbooks/group_vars/all/vars.yml#L18) under the falcosecurity dockerhub repo
Coolest thing we can do is to add a CI on kernel-testing repo to automatically push images to ghcr if needed after a new release. Right now, it is a bit hard because we haven't got any access to the arm64 node used for kernel-testing (it's self-hosted runner is linked to the libs repo), thus we are not able to build and push arm64 images natively. And pushing 6 "big" images using QEMU is going to take hours and hours.
Ideas for v3:
- What is the response to a failed test? Since the CI tests use the optimal compiler version it means the distributed artifact is not working, do we try a different compiler version (likely more relevant for bpf drivers)? Something else?
- More for us developers and maintainers: The locahost VM tests focus more on testing different compiler versions in addition to looping through a few kernels. Historically this has been valuable to spot possible regressions in particular in the bpf drivers. It's related to the suggestion above.
Related to the CI that pushes the images, it would be nice to cache those images on the runner for both docker and ignite. That would speed up the testing process.
I think that it would actually just work :tm: if we use the same nodes to push images and run the tests, right?
For the docker images, the answer is yes, but we need to remove the one cached by ignite and import the new ones.
First drivers release with matrixes attached: https://github.com/falcosecurity/libs/releases/tag/5.1.0%2Bdriver
Since ignite has been archived, we:
- either keep the ignite fork from therealbobo (and donate it to the falcosecurity)
- need to switch to https://github.com/weaveworks-liquidmetal/flintlock
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
So, https://github.com/falcosecurity/kernel-testing/pull/70 and https://github.com/falcosecurity/kernel-testing/pull/74 were merged and we now have:
- images pushed automatically with:
maintag, and$tag,latesttags for releases. Moreover, build is also tested in PR whenimages/subfolder is modified. See github packages for the repo (since images are pushed to ghcr): https://github.com/orgs/falcosecurity/packages?repo_name=kernel-testing - the repo now also provides a composite github action that will be used on libs in place of our own test-kernels job
I am currently:
- adding a clang-7 ubuntu 20.04 matrix entry to retire the last CircleCI job from libs: https://app.circleci.com/pipelines/github/falcosecurity/libs/3254/workflows/bb2271b4-89ac-4a73-99b2-39b3e9b8f786/jobs/8623. It will use the exact same kernel release (ie: 5.8.0-1041-aws)
- Add a new role to run drivers tests (instead of scap-open)
Then, we will need to either fork ignite and improve it to suit our needs, or switch to use flintlock or find something else; moreover, we also rely on weaveworks/ignite-kernel:5.14.16 as kernel image for builders; given that weaveworks is shutting down (https://news.ycombinator.com/item?id=39262650), we should probably either copy those images under falcosecurity or just use one of our kernel images.
Cache ignite root somehow (ie: only rebuild the ignite root used for the VMs when changes to dockerfiles are made); this would greatly speed up tests duration
Idea would be to let the kernel-testing repo access the cncf nodes, then:
- the images would be built on the cncf nodes
- main and release CI would avoid setting
CLEANUPenv, so thatmainand$tagimages are already cached on the nodes - moreover, we could introduce a new playbook that creates ignite roots + one that cleans them, and call them in the release CI, like:
ansible-playbook cleanup-roots.yml && ansible-playbook generate-roots.yml. We first cleanup existing roots, then generate the new one. After this, themain.ymlshould avoid deleting/generating the roots each time.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
For caching, we could try to leverage actions/cache somehow; cache limits for github actions is 10GB that should be enough, possibly: https://github.com/actions/cache?tab=readme-ov-file#cache-limits
Just a quick additional note: @FedeDP I'll get back to trying to also integrate the vagrant test VM loop end of June as we previously discussed, just FYI. I'll ping you to get access to the servers then.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale