eve
eve copied to clipboard
collect minimal kernel dumps having kexec and makefiledump tools in the rootfs
There is currently no way to carefully analyze a kernel panic. This is an rfc which introduces kexec-tools and makefilefump tools in the rootfs of the EVE in order to have a minimal kernel dump in /persist volume for further post-mortem analysis. Kernel dump does not include userspace pages so no security issues (at least which I'm aware of). The whole procedure is the following:
- When kvm x86_64 kernel boots (no xen or arm support for now) the kexec syscall is invoked on early boot stage (before containers) from the init.d /etc/init.d/000-kexec script.
- When kernel panics and a capture kernel is booted the same init.d /etc/init.d/000-kexec script is invoked and prevents further containers start in order not to have docker or EVE services running around.
- When /persist volume is mounted the final init.d /etc/init.d/999-kdump script id invoked which collects a minimal kernel dump.
The whole kernel debug info is archived in the kernel-debug.tar including the vmlinux file. The kernel-debug.tar archive is a part of the eve-kernel container, which can be then on each EVE release pushed to a dockerhub or kept locally for further kernel debugging.
-- Roman
@mikem-zed Would be great you to help to review my pr. Thanks.
I am not 100% positive about the approach we are taking. This seems to be a good candidate for a design proposal on the wiki, or at least a documentation proposal first?
We have the current flow: device boots, it runs. It does not know if there was a previous kernel core dump or not. That may or may not be a good thing.
We have an issue: we want to be able to debug kernel core dumps. I don't disagree that it is a "good thing" ™️ , but do we actually have this case? Or are we planning for something that might happen?
There are several implicit flows and assumptions baked into this solution:
- we want debug tools (and possibly debug kernel) built into every eve-os image
- we want to debug kernel core dumps on the device itself (as opposed to shipping it off-device)
- we want eve-os devices that boot up to stop when they hit a previous kernel coredump and launch debugging tools
These may very well be the correct approach, or they may not. It would be much stronger to have a write-up (either a design proposal on lfedge wiki or as a doc to this repo) that describes what happens when kernel crashes, how flow changes, what impact it has on size and architecture, etc.
Then this PR just becomes implementation questions.
I am not 100% positive about the approach we are taking. This seems to be a good candidate for a design proposal on the wiki, or at least a documentation proposal first?
Nope, I need to analyse/reproduce/fix a bunch of customer kernel panics with corrupted backtraces as a screenshots of iLO, so I tried to come up with a generic approach which can be discussed earlier.
We have the current flow: device boots, it runs. It does not know if there was a previous kernel core dump or not. That may or may not be a good thing.
There is a /proc/vmcore file which indicates we are in capture kernel (original kernel crashed).
We have an issue: we want to be able to debug kernel core dumps. I don't disagree that it is a "good thing" tm , but do we actually have this case? Or are we planning for something that might happen?
As mentioned earlier I have a bunch of customer kernel panics assigned and the only method to debug is to stare at ilo screenshots with corrupted backtraces (only half of them, the other is missing).
There are several implicit flows and assumptions baked into this solution:
- we want debug tools (and possibly debug kernel) built into every eve-os image
In this approach I followed a way which introduces a minimal possible change: I added two tools: kexec, makedumpfile. No "gdb", "crash", kernel debug info is not included as well.
- we want to debug kernel core dumps on the device itself (as opposed to shipping it off-device)
I don't want to :) I want to have a dump and debug on my laptop, minimal kernel dump can be provided by the customer. This is pretty enough for analysis. Other tools handy tools can be installed on the eve node on demand.
- we want eve-os devices that boot up to stop when they hit a previous kernel coredump and launch debugging tools
Not clear what devices do you mean here. Once we are in capture kernel nothing is starts and this is fine, we collect a dump, output a big message on the screen (in order customer to understand what has happened) and do nothing. This is not an eve anymore, no services, nothing, just a kernel and a prompt.
These may very well be the correct approach, or they may not. It would be much stronger to have a write-up (either a design proposal on lfedge wiki or as a doc to this repo) that describes what happens when kernel crashes, how flow changes, what impact it has on size and architecture, etc.
I can update lfedge wiki, let's say call a page as "Kernel dump collection" and describe everything I put here.
-- Roman
@romanp-zed and I had a good discussion about this. I will summarize here, hopefully Roman can correct any errors.
We have a problem. The basic value proposition of eve-os is, "manage edge devices like you manage cloud devices." When an eve-os device hits a kernel panic, the device just hangs. Done. This then necessitates someone to visit the device, and take manual steps to power-cycle the device, let alone try and gather debug information. This in turn violates the eve-os promise.
We need to solve for "how do we manage eve-os devices when they hit kernel panics, such that they continue to be managed like cloud devices?" That is the larger issue, which @romanp-zed should put in the wiki in a design proposal, where it should be discussed until we have a consistent approach.
That design proposal includes multiple steps and partial solutions to round out the whole thing.
One obvious first step is this PR: capture the kernel dump into /persist
, so that we have the option to analyze it in the future. I think that whatever the "big design", this will be part of the solution, and we need it now, so let's go for it.
That leaves open the questions of how we do this.
I would not include debugging tools in the eve-os build unless absolutely unavoidable.
In terms of boot process, @romanp-zed described it to me as follows:
Current:
- We have a kernel panic
- Panic message and stack trace to console
- Device hangs
Proposed:
- We have a kernel panic
- Panic message and stack trace to console
- Automatically launch capture kernel
- Capture kernel captures the core dump and saves it into
/persist
- Device hangs
Please comment if the above is correct.
The question raised by @romanp-zed was how to control the launch process. Normally, init
gets started, which launches everything else. In this proposal, we need to have an init process, but it should just do almost nothing, certainly not launch runc, containerd, and our various onboot and services containers. What is the best way to achieve that?
Did I understand correctly?
In this proposal, we need to have an init process, but it should just do almost nothing, certainly not launch runc, containerd, and our various onboot and services containers. What is the best way to achieve that?
Small comment here: we still want to run storage-init
from onboot section, as it is required to mount /persist
. Or we should re-implement the logic of storage-init
somehow.
That is a good question.
@romanp-zed does the capture kernel actually run any kind of init?
We have an issue: we want to be able to debug kernel core dumps. I don't disagree that it is a "good thing" tm , but do we actually have this case? Or are we planning for something that might happen?
From my perspective we don't need to debug core dumps on the device. But we do want to get the fact that there was a kernel core dump (as opposed to a power failure, or hardware watchdog) into the logs, and extract the kernel stack trace from the core and log that.
An EVE developer might want to look at the core dump on the device, but don't know how much disk space it would take to include the tools in the debug or edgeview containers. (FWIW as some point in time we should externalize those containers from the EVE image to get a smaller image and more flexibility.)
I agree, it shouldn't be on the device. But the general, "here is what will happen when we have a coredump, here is how the device will behave, here is how we will (or will not) debug kernel coredumps" should be in a wiki design proposal.
Current:
- We have a kernel panic
- Panic message and stack trace to console
- Device hangs
@deitch in step 3 the device reboots. Then EVE-OS determines that something unknown happens (not a triggered reboot) so it sets the bootReason to indicate that it was a hardware watchdog or kernel panic which brought it down.
in step 3 the device reboots. Then EVE-OS determines that something unknown happens (not a triggered reboot) so it sets the bootReason to indicate that it was a hardware watchdog or kernel panic which brought it down.
I thought I remembered something like that. @romanp-zed is that different than what we discussed?
In this proposal, we need to have an init process, but it should just do almost nothing, certainly not launch runc, containerd, and our various onboot and services containers. What is the best way to achieve that?
Small comment here: we still want to run
storage-init
from onboot section, as it is required to mount/persist
. Or we should re-implement the logic ofstorage-init
somehow.
@giggsoff I would not touch onboot section or storage-init and will stop the whole boot procedure just right after onboot (as it is in this PR), I don't know (but I have a rather shallow understanding of the whole philosophy around) any reason why we need to reimplement the storage-init/onboot.
in step 3 the device reboots. Then EVE-OS determines that something unknown happens (not a triggered reboot) so it sets the bootReason to indicate that it was a hardware watchdog or kernel panic which brought it down.
I thought I remembered something like that. @romanp-zed is that different than what we discussed?
So indeed there is a configuration line is sysctl.conf which sets reboot after 120 seconds in case of panic. But there is one panic assigned to me (hardware problem) on early stage of boot (8th second) before sysctl is invoked or rootfs is mounted which leads to "device hang". And this panic is the source of my confusion. The only way to reboot a host on early boot is to provide a "panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.
Yesterday I had a fruitful conversation with Eric regarding this PR and the whole idea is the following:
- A new container "kdump" (collects a dump if a capture kernel is detected) comes strictly the last in onboot section.
- Once "kdump" is invoked and dump is collected a machine is rebooted in 120s.
With this we a) keep the same behaviour and reboot the host, b) don't do nasty tricks to stop containerd from further execution
So the rootfs.yaml
is changed so that:
onboot:
- name: rngd
image: RNGD_TAG
command: ["/sbin/rngd", "-1"]
- name: sysctl
image: linuxkit/sysctl:v0.5
binds:
- /etc/sysctl.d:/etc/sysctl.d
capabilities:
- CAP_SYS_ADMIN
- CAP_NET_ADMIN
- name: modprobe
image: linuxkit/modprobe:v0.5
command: ["/bin/sh", "-c", "modprobe -a nct6775 w83627hf_wdt hpwdt wlcore_sdio wl18xx br_netfilter dwc3 rk808 rk808-regulator smsc75xx cp210x nicvf tpm_tis_spi rtc_rx8010 gpio_pca953x leds_siemens_ipc127 upboard-fpga pinctrl-upboard leds-upboard xhci_tegra 2>/dev/null || :"]
- name: storage-init
image: STORAGE_INIT_TAG
+ - name: kdump
+ image: lfedge/eve-kdump:<some-tag>
Where the kdump
container does the following:
- if a coredump is detected save it to persist and reboot
- if none is detected, do nothing
How will that capture it? By the time the kernel panics, you are long past any kdump
having started and exited. And wouldn't you still need to run kexec to get into the capture kernel?
I had originally thought you meant this:
- Normal kernel startup
- Normal init
- Normal
onboot
andservices
- Eventually, kernel panic
- 120s later, reboot
-
kdump
container sees that we are post-panic and saves the coredump, then reboots again
But I realized that doesn't work; by the time we get to the kdump
, we have rebooted and lost the coredump.
Is there anything we can change about the OS composition design that would enable this? This seems like a good use case. We already have an onboot
(runs the following containers sequentially via runc
on startup) and services
(runs the following containers in parallel via containerd
) and onshutdown
(runs the following containers sequentially via runc
on shutdown).
Is there a reasonable mechanism for onpanic
which would be able to set up at the start, "run these on a kernel panic"? I don't know if they could be containers, in case runc or one of its linked libraries is part of the problem or there is a bug in kernel namespaces, etc.? Any good ideas for that?
"panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.
Seems like we need to set that kconfig.
How will that capture it? By the time the kernel panics, you are long past any
kdump
having started and exited. And wouldn't you still need to run kexec to get into the capture kernel?
@deitch there is still a kexec setup in pkg/dom0-ztools/rootfs/etc/init.d/000-kexec to make sure we run the crash kernel on panic.
@eriknordmark wrote:
there is still a kexec setup in pkg/dom0-ztools/rootfs/etc/init.d/000-kexec to make sure we run the crash kernel on panic.
Yeah, I missed that in here, makes sense.
So there is a part I still don't get. What does this all have to do with the various onboot
and services
containers? When we kexec into the capture kernel, nothing else will be started; we aren't going through a full init process.
Without the additional onboot container we cane up with (but not yet in the pr) it would start the service containers and @rouming tried to prevent that.
That was what I don't get. Our boot process is something like:
- BIOS/UEFI
- normal kernel
- init
- onboot containers
- service containers
- everything is fine for a while
- kernel panic
- kexec capture kernel
- capture kernel saves the coredump
- reboot
onboot and service containers were launched way earlier (steps 4 and 5). Why would they launch again? Unless step 9 launches regular init
? Why would it do that?
"panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.
Seems like we need to set that kconfig.
As I remember we set it using sysctl as we still have the file from alpine: https://gitlab.alpinelinux.org/alpine/aports/-/blob/3.16-stable/main/alpine-baselayout/APKBUILD#L214
"panic=120" kernel command line from grub or to modify the CONFIG_PANIC_TIMEOUT (which is set to 0 now). So the "device hang" problem remains.
Seems like we need to set that kconfig.
As I remember we set it using sysctl as we still have the file from alpine: https://gitlab.alpinelinux.org/alpine/aports/-/blob/3.16-stable/main/alpine-baselayout/APKBUILD#L214
Yes exactly, but this is not enough if you crash before sysctl is invoked. Rare, but possible
onboot and service containers were launched way earlier (steps 4 and 5). Why would they launch again? Unless step 9 launches regular
init
? Why would it do that?
My understanding is that step 9 boots the same thing as in step 2, thus step3 etc will follow. But I'll let @rouming clarify
onboot and service containers were launched way earlier (steps 4 and 5). Why would they launch again? Unless step 9 launches regular
init
? Why would it do that?My understanding is that step 9 boots the same thing as in step 2, thus step3 etc will follow. But I'll let @rouming clarify
Let me put the documentation link here: https://docs.kernel.org/admin-guide/kdump/kdump.html
In very simple words: kernel is just an application, imagine that application crashes, catches sigfault (panic in kernel terms) and execs into itself, repeating the whole procedure starting from main function (boot in terms of kernel). Hope this helps.
Does it? I had thought capture kernels were specifically supposed not to do so. Can we not configure it not to? Why would we want all init to run? Isn’t the assumption that something caused the panic, so we want the bare minimum to run, just enough to save data and then get out of there?
Does it? I had thought capture kernels were specifically supposed not to do so. Can we not configure it not to? Why would we want all init to run? Isn’t the assumption that something caused the panic, so we want the bare minimum to run, just enough to save data and then get out of there?
I like that attitude of throwing questions like a machine gun :) I'll be more consise: no we can't, because kernel needs a userspace entry point (init) to be executed.
Haha! Not machine gun. Single shot. I fire, you fire. Back and forth. Wear your Kevlar!
What do “full” or “normal” distros do? They cannot do a full startup cycle.
What do “full” or “normal” distros do? They cannot do a full startup cycle.
Why? You can restart everything, no problems here at all. Depends on the use case. In eve use case we need a minimal amount of userspace processes running around, that's why we need to collect a dump asap, just after persist is mounted and then safely escape by rebooting.
I think you missed my point.
The point of a capture kernel is to, well, capture and then get out. If you restart the whole thing, you are likely to trigger the crash again, or not have enabled features you need, etc. So what do they do in normal distros?