libs icon indicating copy to clipboard operation
libs copied to clipboard

Kernel Version Testing Framework improvements

Open FedeDP opened this issue 2 years ago • 23 comments

See #1191

These are some improvements that need to land for kernel version testing framework.

Needed before "v1":

  • [x] Donate https://github.com/alacuku/e2e-falco-tests to falcosecurity (tracked by https://github.com/falcosecurity/evolution/issues/282)
  • [x] Copy all images used by the matrix (ie: https://github.com/falcosecurity/kernel-testing/blob/main/ansible-playbooks/group_vars/all/vars.yml#L18) under the falcosecurity dockerhub repo
  • [x] add a CI on kernel-testing repo to automatically push images (https://github.com/falcosecurity/kernel-testing/pull/70/files; NEEDS new self-hosted runners deployed to support kernel-testing repo)

v2 stuff

  • [ ] Switch to drivers_test executable instead of scap-open to also verify drivers correct behavior
  • [ ] Terraform for nodes deployment
  • [ ] Cache ignite root somehow (ie: only rebuild the ignite root used for the VMs when changes to dockerfiles are made); this would greatly speed up tests duration
  • [x] attach the matrixes markdown to the github release page for new driver releases (https://github.com/falcosecurity/libs/pull/1238)
  • [ ] ~~upstream our ignite patch from https://github.com/therealbobo/ignite~~ upstream project is archived
  • [ ] avoid using any weaveworks docker images as weaveworks is shutting down:
    • [ ] weaveworks/ignite-kernel:5.14.16
    • [ ] weaveworks/ubuntu-kernel:5.14.16

Future ideas

  • [ ] Automatically fetch needed info (kernel images, modules and so on) from kernel-crawler
  • [ ] Automatically build input test matrix (ie: list of images to be tested) given weekly kernel-crawler output (ie: add eg: 1 image per each crawled distro each week, enlarging our input test matrix)
  • [ ] make ignite concurrent (right now it does not support concurrent runs at all, preventing us to add kernel tests to PR ci)

FedeDP avatar Jul 24 '23 10:07 FedeDP

Copy all images used by the matrix (ie: https://github.com/falcosecurity/kernel-testing/blob/main/ansible-playbooks/group_vars/all/vars.yml#L18) under the falcosecurity dockerhub repo

Coolest thing we can do is to add a CI on kernel-testing repo to automatically push images to ghcr if needed after a new release. Right now, it is a bit hard because we haven't got any access to the arm64 node used for kernel-testing (it's self-hosted runner is linked to the libs repo), thus we are not able to build and push arm64 images natively. And pushing 6 "big" images using QEMU is going to take hours and hours.

FedeDP avatar Jul 27 '23 12:07 FedeDP

Ideas for v3:

  • What is the response to a failed test? Since the CI tests use the optimal compiler version it means the distributed artifact is not working, do we try a different compiler version (likely more relevant for bpf drivers)? Something else?
  • More for us developers and maintainers: The locahost VM tests focus more on testing different compiler versions in addition to looping through a few kernels. Historically this has been valuable to spot possible regressions in particular in the bpf drivers. It's related to the suggestion above.

incertum avatar Jul 27 '23 23:07 incertum

Related to the CI that pushes the images, it would be nice to cache those images on the runner for both docker and ignite. That would speed up the testing process.

alacuku avatar Aug 01 '23 07:08 alacuku

I think that it would actually just work :tm: if we use the same nodes to push images and run the tests, right?

FedeDP avatar Aug 01 '23 07:08 FedeDP

For the docker images, the answer is yes, but we need to remove the one cached by ignite and import the new ones.

alacuku avatar Aug 01 '23 07:08 alacuku

First drivers release with matrixes attached: https://github.com/falcosecurity/libs/releases/tag/5.1.0%2Bdriver

FedeDP avatar Aug 01 '23 09:08 FedeDP

Since ignite has been archived, we:

  • either keep the ignite fork from therealbobo (and donate it to the falcosecurity)
  • need to switch to https://github.com/weaveworks-liquidmetal/flintlock

FedeDP avatar Sep 06 '23 08:09 FedeDP

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Dec 05 '23 09:12 poiana

/remove-lifecycle stale

FedeDP avatar Dec 05 '23 09:12 FedeDP

So, https://github.com/falcosecurity/kernel-testing/pull/70 and https://github.com/falcosecurity/kernel-testing/pull/74 were merged and we now have:

  • images pushed automatically with: main tag, and $tag,latest tags for releases. Moreover, build is also tested in PR when images/ subfolder is modified. See github packages for the repo (since images are pushed to ghcr): https://github.com/orgs/falcosecurity/packages?repo_name=kernel-testing
  • the repo now also provides a composite github action that will be used on libs in place of our own test-kernels job

I am currently:

  • adding a clang-7 ubuntu 20.04 matrix entry to retire the last CircleCI job from libs: https://app.circleci.com/pipelines/github/falcosecurity/libs/3254/workflows/bb2271b4-89ac-4a73-99b2-39b3e9b8f786/jobs/8623. It will use the exact same kernel release (ie: 5.8.0-1041-aws)
  • Add a new role to run drivers tests (instead of scap-open)

Then, we will need to either fork ignite and improve it to suit our needs, or switch to use flintlock or find something else; moreover, we also rely on weaveworks/ignite-kernel:5.14.16 as kernel image for builders; given that weaveworks is shutting down (https://news.ycombinator.com/item?id=39262650), we should probably either copy those images under falcosecurity or just use one of our kernel images.

FedeDP avatar Feb 07 '24 08:02 FedeDP

Cache ignite root somehow (ie: only rebuild the ignite root used for the VMs when changes to dockerfiles are made); this would greatly speed up tests duration

Idea would be to let the kernel-testing repo access the cncf nodes, then:

  • the images would be built on the cncf nodes
  • main and release CI would avoid setting CLEANUP env, so that main and $tag images are already cached on the nodes
  • moreover, we could introduce a new playbook that creates ignite roots + one that cleans them, and call them in the release CI, like: ansible-playbook cleanup-roots.yml && ansible-playbook generate-roots.yml. We first cleanup existing roots, then generate the new one. After this, the main.yml should avoid deleting/generating the roots each time.

FedeDP avatar Feb 08 '24 08:02 FedeDP

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar May 08 '24 09:05 poiana

/remove-lifecycle stale

FedeDP avatar May 08 '24 09:05 FedeDP

For caching, we could try to leverage actions/cache somehow; cache limits for github actions is 10GB that should be enough, possibly: https://github.com/actions/cache?tab=readme-ov-file#cache-limits

FedeDP avatar May 08 '24 09:05 FedeDP

Just a quick additional note: @FedeDP I'll get back to trying to also integrate the vagrant test VM loop end of June as we previously discussed, just FYI. I'll ping you to get access to the servers then.

incertum avatar May 08 '24 09:05 incertum

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Aug 06 '24 10:08 poiana

/remove-lifecycle stale

Andreagit97 avatar Aug 06 '24 10:08 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Nov 04 '24 16:11 poiana

/remove-lifecycle stale

FedeDP avatar Nov 04 '24 16:11 FedeDP

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Feb 02 '25 22:02 poiana

/remove-lifecycle stale

FedeDP avatar Feb 03 '25 07:02 FedeDP

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar May 04 '25 10:05 poiana

/remove-lifecycle stale

FedeDP avatar May 05 '25 06:05 FedeDP

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Aug 03 '25 10:08 poiana

/remove-lifecycle stale

FedeDP avatar Aug 04 '25 07:08 FedeDP

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Nov 02 '25 10:11 poiana