kernel-collector icon indicating copy to clipboard operation
kernel-collector copied to clipboard

Build e2e Testing CI to validate eBPF and collection

Open prologic opened this issue 5 years ago • 15 comments

prologic avatar Feb 05 '20 22:02 prologic

@Ferroin So youre kernel-check.sh script works wonderfully Thank you :)

Now I should point out that Doekcer for Desktop <= 2.1 is not enough to do quick e2e testing here:

/ # curl -sS https://raw.githubusercontent.com/netdata/kernel-collector/master/kernel-check.sh | bash -
Your kernel appears to be older than 4.11. This may still work in some cases, but probably won't.
Required kernel config options not found.
/ # uname -r
4.9.184-linuxkit

Upgrading to 2.2+ comes with linuxkit which brings Kernel 4.19+ so this should work better; for quick testing. I should note however that I don't think the linixkit kernsls that are used for Docker for Desktop on macOS / Windows likely have the required kernel options enabled to do proper e2e testing; But I'm trying to see if we can implement very fast/quick validation that the stuff we're building actually works and at least can be loaded.

prologic avatar Feb 07 '20 02:02 prologic

So newer Kernel; but linuixkit as I suspected doesn't have the required Kernel optilnos enabled:

/ # curl -sS https://raw.githubusercontent.com/netdata/kernel-collector/master/kernel-check.sh | bash -
Required kernel config options not found.

@thiagoftsm Is it going to be possible to do any kind of quick validation that this stuff works without having to spin up a full blown VM with all the required Kernel options we want/need? The use-case here is mostly for CI as a quick-feedback that you haven't broken basic builds and basic loading across all the things we care about.

We will still have to build complete full VM based e2e testing as well; but that will be more involved.

prologic avatar Feb 07 '20 03:02 prologic

Yes, it is possible. I can isolate code from the collector that loads an eBPF program and unload it. Case everything goes fine, we know it is running like we expect.

thiagoftsm avatar Feb 07 '20 04:02 thiagoftsm

Yes, it is possible. I can isolate code from the collector that loads an eBPF program and unload it. Case everything goes fine, we know it is running like we expect.

I would probably do it here in this repo -- without the agent. Just a simple program that we can run very quickly to validate that the library and ebpf programs actually function (at all). The full e2e agent testing will have to be done on a set of real VM(s) anyway. Make sense?

prologic avatar Feb 07 '20 07:02 prologic

I agree. I will only copy the code from the agent to this test.

thiagoftsm avatar Feb 07 '20 12:02 thiagoftsm

I think your test program requires a real Machine with the proper Kernel options enabled to work 🤔

prologic avatar Feb 11 '20 04:02 prologic

I also managed to get it to segfault once :)

prologic avatar Feb 11 '20 04:02 prologic

I think your test program requires a real Machine with the proper Kernel options enabled to work thinking

The requirements were written on issue https://github.com/netdata/netdata/issues/7771, it is necessary to have a kernel compiled with the necessary option to enable KPROBE and we have to mount debugfs and tracefs. When you return we can see this.

I also managed to get it to segfault once :)

I am glad that this is not the collector code, but I want to see more details about this error your got late.

thiagoftsm avatar Feb 11 '20 11:02 thiagoftsm

The requirements were written on issue netdata/netdata#7771, it is necessary to have a kernel compiled with the necessary option to enable KPROBE and we have to mount debugfs and tracefs. When you return we can see this.

👍 yeah I remember :) Just trying to figure out the best approach here with a good balance of effort vs. correctness.

prologic avatar Feb 11 '20 12:02 prologic

I am glad that this is not the collector code, but I want to see more details about this error your got late.

I can repro it when I wake up :)

prologic avatar Feb 11 '20 12:02 prologic

I was testing now the libraries that were made in our CI and I detected that the final file artifacts.zip is compressing a symbolic link instead a shared library.

thiagoftsm avatar Feb 11 '20 19:02 thiagoftsm

I was testing now the libraries that were made in our CI and I detected that the final file artifacts.zip is compressing a symbolic link instead a shared library.

Did you have a PR up to fix this?

prologic avatar Feb 12 '20 01:02 prologic

We need to work on this next.

prologic avatar Feb 24 '20 03:02 prologic

As labeled by @thiagoftsm already; I am halting work on this specifically as the effort has become too large to do in one shot. I have scattered bits and pieces of "things" but nothing that would be a single PR to solve this.

I'll spend some time breaking this down further later on in the week.

prologic avatar Mar 02 '20 22:03 prologic

We still need to do this and this adds a lot of value if we do for prep work for doing similar things for the NetData Agent. cc @Ferroin lets discuss this between us tomorrow. My goal here would be to try and find the minimal viable amount of work we can do to get this done without it being an EPIC and close it out and iterate.

prologic avatar Mar 05 '20 01:03 prologic