eve collect minimal kernel dumps having kexec and makefiledump tools in the rootfs

There is currently no way to carefully analyze a kernel panic. This is an rfc which introduces kexec-tools and makefilefump tools in the rootfs of the EVE in order to have a minimal kernel dump in /persist volume for further post-mortem analysis. Kernel dump does not include userspace pages so no security issues (at least which I'm aware of). The whole procedure is the following:

When kvm x86_64 kernel boots (no xen or arm support for now) the kexec syscall is invoked on early boot stage (before containers) from the init.d /etc/init.d/000-kexec script.
When kernel panics and a capture kernel is booted the same init.d /etc/init.d/000-kexec script is invoked and prevents further containers start in order not to have docker or EVE services running around.
When /persist volume is mounted the final init.d /etc/init.d/999-kdump script id invoked which collects a minimal kernel dump.

The whole kernel debug info is archived in the kernel-debug.tar including the vmlinux file. The kernel-debug.tar archive is a part of the eve-kernel container, which can be then on each EVE release pushed to a dockerhub or kept locally for further kernel debugging.

-- Roman

Sep 06 '22 17:09 rouming

@mikem-zed Would be great you to help to review my pr. Thanks.

Sep 06 '22 17:09 rouming

I am not 100% positive about the approach we are taking. This seems to be a good candidate for a design proposal on the wiki, or at least a documentation proposal first?

We have the current flow: device boots, it runs. It does not know if there was a previous kernel core dump or not. That may or may not be a good thing.

We have an issue: we want to be able to debug kernel core dumps. I don't disagree that it is a "good thing" ™️ , but do we actually have this case? Or are we planning for something that might happen?

There are several implicit flows and assumptions baked into this solution:

we want debug tools (and possibly debug kernel) built into every eve-os image
we want to debug kernel core dumps on the device itself (as opposed to shipping it off-device)
we want eve-os devices that boot up to stop when they hit a previous kernel coredump and launch debugging tools

These may very well be the correct approach, or they may not. It would be much stronger to have a write-up (either a design proposal on lfedge wiki or as a doc to this repo) that describes what happens when kernel crashes, how flow changes, what impact it has on size and architecture, etc.

Then this PR just becomes implementation questions.

Sep 07 '22 14:09 deitch

I am not 100% positive about the approach we are taking. This seems to be a good candidate for a design proposal on the wiki, or at least a documentation proposal first?

Nope, I need to analyse/reproduce/fix a bunch of customer kernel panics with corrupted backtraces as a screenshots of iLO, so I tried to come up with a generic approach which can be discussed earlier.

We have the current flow: device boots, it runs. It does not know if there was a previous kernel core dump or not. That may or may not be a good thing.

There is a /proc/vmcore file which indicates we are in capture kernel (original kernel crashed).

We have an issue: we want to be able to debug kernel core dumps. I don't disagree that it is a "good thing" tm , but do we actually have this case? Or are we planning for something that might happen?

As mentioned earlier I have a bunch of customer kernel panics assigned and the only method to debug is to stare at ilo screenshots with corrupted backtraces (only half of them, the other is missing).

There are several implicit flows and assumptions baked into this solution:

we want debug tools (and possibly debug kernel) built into every eve-os image

In this approach I followed a way which introduces a minimal possible change: I added two tools: kexec, makedumpfile. No "gdb", "crash", kernel debug info is not included as well.

we want to debug kernel core dumps on the device itself (as opposed to shipping it off-device)

I don't want to :) I want to have a dump and debug on my laptop, minimal kernel dump can be provided by the customer. This is pretty enough for analysis. Other tools handy tools can be installed on the eve node on demand.

we want eve-os devices that boot up to stop when they hit a previous kernel coredump and launch debugging tools

Not clear what devices do you mean here. Once we are in capture kernel nothing is starts and this is fine, we collect a dump, output a big message on the screen (in order customer to understand what has happened) and do nothing. This is not an eve anymore, no services, nothing, just a kernel and a prompt.

These may very well be the correct approach, or they may not. It would be much stronger to have a write-up (either a design proposal on lfedge wiki or as a doc to this repo) that describes what happens when kernel crashes, how flow changes, what impact it has on size and architecture, etc.

I can update lfedge wiki, let's say call a page as "Kernel dump collection" and describe everything I put here.

-- Roman

Sep 12 '22 11:09 romanp-zed

@romanp-zed and I had a good discussion about this. I will summarize here, hopefully Roman can correct any errors.

We have a problem. The basic value proposition of eve-os is, "manage edge devices like you manage cloud devices." When an eve-os device hits a kernel panic, the device just hangs. Done. This then necessitates someone to visit the device, and take manual steps to power-cycle the device, let alone try and gather debug information. This in turn violates the eve-os promise.

We need to solve for "how do we manage eve-os devices when they hit kernel panics, such that they continue to be managed like cloud devices?" That is the larger issue, which @romanp-zed should put in the wiki in a design proposal, where it should be discussed until we have a consistent approach.

That design proposal includes multiple steps and partial solutions to round out the whole thing.

One obvious first step is this PR: capture the kernel dump into /persist, so that we have the option to analyze it in the future. I think that whatever the "big design", this will be part of the solution, and we need it now, so let's go for it.

Sep 12 '22 13:09 deitch

That leaves open the questions of how we do this.

I would not include debugging tools in the eve-os build unless absolutely unavoidable.

In terms of boot process, @romanp-zed described it to me as follows:

Current:

We have a kernel panic
Panic message and stack trace to console
Device hangs

Proposed:

We have a kernel panic
Panic message and stack trace to console
Automatically launch capture kernel
Capture kernel captures the core dump and saves it into /persist
Device hangs

Please comment if the above is correct.

The question raised by @romanp-zed was how to control the launch process. Normally, init gets started, which launches everything else. In this proposal, we need to have an init process, but it should just do almost nothing, certainly not launch runc, containerd, and our various onboot and services containers. What is the best way to achieve that?

Did I understand correctly?

Sep 12 '22 13:09 deitch

In this proposal, we need to have an init process, but it should just do almost nothing, certainly not launch runc, containerd, and our various onboot and services containers. What is the best way to achieve that?

Small comment here: we still want to run storage-init from onboot section, as it is required to mount /persist. Or we should re-implement the logic of storage-init somehow.

Sep 12 '22 13:09 giggsoff

That is a good question.

@romanp-zed does the capture kernel actually run any kind of init?

Sep 12 '22 13:09 deitch

We have an issue: we want to be able to debug kernel core dumps. I don't disagree that it is a "good thing" tm , but do we actually have this case? Or are we planning for something that might happen?

From my perspective we don't need to debug core dumps on the device. But we do want to get the fact that there was a kernel core dump (as opposed to a power failure, or hardware watchdog) into the logs, and extract the kernel stack trace from the core and log that.

An EVE developer might want to look at the core dump on the device, but don't know how much disk space it would take to include the tools in the debug or edgeview containers. (FWIW as some point in time we should externalize those containers from the EVE image to get a smaller image and more flexibility.)

Sep 12 '22 16:09 eriknordmark

I agree, it shouldn't be on the device. But the general, "here is what will happen when we have a coredump, here is how the device will behave, here is how we will (or will not) debug kernel coredumps" should be in a wiki design proposal.

Sep 12 '22 16:09 deitch

Current:

We have a kernel panic

Panic message and stack trace to console

Device hangs

@deitch in step 3 the device reboots. Then EVE-OS determines that something unknown happens (not a triggered reboot) so it sets the bootReason to indicate that it was a hardware watchdog or kernel panic which brought it down.

Sep 12 '22 16:09 eriknordmark

in step 3 the device reboots. Then EVE-OS determines that something unknown happens (not a triggered reboot) so it sets the bootReason to indicate that it was a hardware watchdog or kernel panic which brought it down.

I thought I remembered something like that. @romanp-zed is that different than what we discussed?

Sep 12 '22 16:09 deitch

In this proposal, we need to have an init process, but it should just do almost nothing, certainly not launch runc, containerd, and our various onboot and services containers. What is the best way to achieve that?

Small comment here: we still want to run storage-init from onboot section, as it is required to mount /persist. Or we should re-implement the logic of storage-init somehow.

@giggsoff I would not touch onboot section or storage-init and will stop the whole boot procedure just right after onboot (as it is in this PR), I don't know (but I have a rather shallow understanding of the whole philosophy around) any reason why we need to reimplement the storage-init/onboot.

Sep 13 '22 11:09 romanp-zed

in step 3 the device reboots. Then EVE-OS determines that something unknown happens (not a triggered reboot) so it sets the bootReason to indicate that it was a hardware watchdog or kernel panic which brought it down.

I thought I remembered something like that. @romanp-zed is that different than what we discussed?

So indeed there is a configuration line is sysctl.conf which sets reboot after 120 seconds in case of panic. But there is one panic assigned to me (hardware problem) on early stage of boot (8th second) before sysctl is invoked or rootfs is mounted which leads to "device hang". And this panic is the source of my confusion. The only way to reboot a host on early boot is to provide a "panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.

Yesterday I had a fruitful conversation with Eric regarding this PR and the whole idea is the following:

A new container "kdump" (collects a dump if a capture kernel is detected) comes strictly the last in onboot section.
Once "kdump" is invoked and dump is collected a machine is rebooted in 120s.

With this we a) keep the same behaviour and reboot the host, b) don't do nasty tricks to stop containerd from further execution

Sep 13 '22 11:09 romanp-zed

So the rootfs.yaml is changed so that:

onboot:
   - name: rngd
     image: RNGD_TAG
     command: ["/sbin/rngd", "-1"]
   - name: sysctl
     image: linuxkit/sysctl:v0.5
     binds:
        - /etc/sysctl.d:/etc/sysctl.d
     capabilities:
        - CAP_SYS_ADMIN
        - CAP_NET_ADMIN
   - name: modprobe
     image: linuxkit/modprobe:v0.5
     command: ["/bin/sh", "-c", "modprobe -a nct6775 w83627hf_wdt hpwdt wlcore_sdio wl18xx br_netfilter dwc3 rk808 rk808-regulator smsc75xx cp210x nicvf tpm_tis_spi rtc_rx8010 gpio_pca953x leds_siemens_ipc127 upboard-fpga pinctrl-upboard leds-upboard xhci_tegra 2>/dev/null || :"]
   - name: storage-init
     image: STORAGE_INIT_TAG
+   - name: kdump
+      image: lfedge/eve-kdump:<some-tag>

Where the kdump container does the following:

if a coredump is detected save it to persist and reboot
if none is detected, do nothing

How will that capture it? By the time the kernel panics, you are long past any kdump having started and exited. And wouldn't you still need to run kexec to get into the capture kernel?

I had originally thought you meant this:

Normal kernel startup
Normal init
Normal onboot and services
Eventually, kernel panic
120s later, reboot
kdump container sees that we are post-panic and saves the coredump, then reboots again

But I realized that doesn't work; by the time we get to the kdump, we have rebooted and lost the coredump.

Sep 13 '22 11:09 deitch

Is there anything we can change about the OS composition design that would enable this? This seems like a good use case. We already have an onboot (runs the following containers sequentially via runc on startup) and services (runs the following containers in parallel via containerd) and onshutdown (runs the following containers sequentially via runc on shutdown).

Is there a reasonable mechanism for onpanic which would be able to set up at the start, "run these on a kernel panic"? I don't know if they could be containers, in case runc or one of its linked libraries is part of the problem or there is a bug in kernel namespaces, etc.? Any good ideas for that?

Sep 13 '22 11:09 deitch

"panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.

Seems like we need to set that kconfig.

Sep 14 '22 18:09 eriknordmark

How will that capture it? By the time the kernel panics, you are long past any kdump having started and exited. And wouldn't you still need to run kexec to get into the capture kernel?

@deitch there is still a kexec setup in pkg/dom0-ztools/rootfs/etc/init.d/000-kexec to make sure we run the crash kernel on panic.

Sep 14 '22 18:09 eriknordmark

@eriknordmark wrote:

there is still a kexec setup in pkg/dom0-ztools/rootfs/etc/init.d/000-kexec to make sure we run the crash kernel on panic.

Yeah, I missed that in here, makes sense.

So there is a part I still don't get. What does this all have to do with the various onboot and services containers? When we kexec into the capture kernel, nothing else will be started; we aren't going through a full init process.

Sep 15 '22 07:09 deitch

Without the additional onboot container we cane up with (but not yet in the pr) it would start the service containers and @rouming tried to prevent that.

Sep 15 '22 16:09 eriknordmark

That was what I don't get. Our boot process is something like:

BIOS/UEFI
normal kernel
init
onboot containers
service containers
everything is fine for a while
kernel panic
kexec capture kernel
capture kernel saves the coredump
reboot

onboot and service containers were launched way earlier (steps 4 and 5). Why would they launch again? Unless step 9 launches regular init? Why would it do that?

Sep 15 '22 16:09 deitch

"panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.

Seems like we need to set that kconfig.

As I remember we set it using sysctl as we still have the file from alpine: https://gitlab.alpinelinux.org/alpine/aports/-/blob/3.16-stable/main/alpine-baselayout/APKBUILD#L214

Sep 15 '22 16:09 giggsoff

"panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.

Seems like we need to set that kconfig.

As I remember we set it using sysctl as we still have the file from alpine: https://gitlab.alpinelinux.org/alpine/aports/-/blob/3.16-stable/main/alpine-baselayout/APKBUILD#L214

Yes exactly, but this is not enough if you crash before sysctl is invoked. Rare, but possible

Sep 15 '22 16:09 rouming

onboot and service containers were launched way earlier (steps 4 and 5). Why would they launch again? Unless step 9 launches regular init? Why would it do that?

My understanding is that step 9 boots the same thing as in step 2, thus step3 etc will follow. But I'll let @rouming clarify

Sep 16 '22 01:09 eriknordmark

onboot and service containers were launched way earlier (steps 4 and 5). Why would they launch again? Unless step 9 launches regular init? Why would it do that?

My understanding is that step 9 boots the same thing as in step 2, thus step3 etc will follow. But I'll let @rouming clarify

Let me put the documentation link here: https://docs.kernel.org/admin-guide/kdump/kdump.html

In very simple words: kernel is just an application, imagine that application crashes, catches sigfault (panic in kernel terms) and execs into itself, repeating the whole procedure starting from main function (boot in terms of kernel). Hope this helps.

Sep 16 '22 06:09 rouming

Does it? I had thought capture kernels were specifically supposed not to do so. Can we not configure it not to? Why would we want all init to run? Isn’t the assumption that something caused the panic, so we want the bare minimum to run, just enough to save data and then get out of there?

Sep 16 '22 06:09 deitch

Does it? I had thought capture kernels were specifically supposed not to do so. Can we not configure it not to? Why would we want all init to run? Isn’t the assumption that something caused the panic, so we want the bare minimum to run, just enough to save data and then get out of there?

I like that attitude of throwing questions like a machine gun :) I'll be more consise: no we can't, because kernel needs a userspace entry point (init) to be executed.

Sep 16 '22 06:09 rouming

Haha! Not machine gun. Single shot. I fire, you fire. Back and forth. Wear your Kevlar!

Sep 16 '22 06:09 deitch

What do “full” or “normal” distros do? They cannot do a full startup cycle.

Sep 16 '22 06:09 deitch

What do “full” or “normal” distros do? They cannot do a full startup cycle.

Why? You can restart everything, no problems here at all. Depends on the use case. In eve use case we need a minimal amount of userspace processes running around, that's why we need to collect a dump asap, just after persist is mounted and then safely escape by rebooting.

Sep 16 '22 06:09 rouming

I think you missed my point.

The point of a capture kernel is to, well, capture and then get out. If you restart the whole thing, you are likely to trigger the crash again, or not have enabled features you need, etc. So what do they do in normal distros?

Sep 16 '22 06:09 deitch

eve eve copied to clipboard

collect minimal kernel dumps having kexec and makefiledump tools in the rootfs

eve
eve copied to clipboard