Random failure in Ubuntu linux based custom container
Bug description
To start off with, this bug appears to be very similar to #380 to the point of possibly being the same thing (hence why I gave it basically the same title lol) - the behavior is the same, but the backtrace I'm seeing and my deployment env are slightly different. Anyway, I too am seeing random errors, complaining about an unwrap on a none value on the same file/line:
root@image-explorer:/app# /usr/local/bin/scaphandre --no-header stdout -t3 -s2
scaphandre::sensors: Sysinfo sees 8
Measurement step is: 2s
thread 'main' panicked at src/sensors/utils.rs:177:18:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
root@image-explorer:/app# RUST_BACKTRACE=full !!
RUST_BACKTRACE=full /usr/local/bin/scaphandre --no-header stdout -t3 -s2
scaphandre::sensors: Sysinfo sees 8
Measurement step is: 2s
thread 'main' panicked at src/sensors/utils.rs:177:18:
called `Option::unwrap()` on a `None` value
stack backtrace:
0: 0x57944f29dbf2 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hffecb437d922f988
1: 0x57944f2c80ac - core::fmt::write::hd9a8d7d029f9ea1a
2: 0x57944f29ab9f - std::io::Write::write_fmt::h0e1226b2b8d973fe
3: 0x57944f29d9c4 - std::sys_common::backtrace::print::he907f6ad7eee41cb
4: 0x57944f29efdb - std::panicking::default_hook::{{closure}}::h3926193b61c9ca9b
5: 0x57944f29ed33 - std::panicking::default_hook::h25ba2457dea68e65
6: 0x57944f29f47d - std::panicking::rust_panic_with_hook::h0ad14d90dcf5224f
7: 0x57944f29f319 - std::panicking::begin_panic_handler::{{closure}}::h4a1838a06f542647
8: 0x57944f29e0c6 - std::sys_common::backtrace::__rust_end_short_backtrace::h77cc4dc3567ca904
9: 0x57944f29f084 - rust_begin_unwind
10: 0x57944ec57555 - core::panicking::panic_fmt::h940d4fd01a4b4fd1
11: 0x57944ec57613 - core::panicking::panic::h8ddd58dc57c2dc00
12: 0x57944ec574f6 - core::option::unwrap_failed::hf59153bb1e2fc334
13: 0x57944ec8f49f - scaphandre::exporters::MetricGenerator::gen_self_metrics::h954b2a30e12fd3e4
14: 0x57944ec97fda - scaphandre::exporters::MetricGenerator::gen_all_metrics::h83801832725d38eb
15: 0x57944ed33497 - scaphandre::exporters::stdout::StdoutExporter::iterate::h9618ea731418915b
16: 0x57944ed332a8 - <scaphandre::exporters::stdout::StdoutExporter as scaphandre::exporters::Exporter>::run::h8c6d61ad83c2efa1
17: 0x57944ec675c1 - scaphandre::main::hf7d485085ccc2078
18: 0x57944ec79453 - std::sys_common::backtrace::__rust_begin_short_backtrace::h5ccc291e7ca8831d
19: 0x57944ec77899 - std::rt::lang_start::{{closure}}::h5997e8809ce8e164
20: 0x57944f2951a3 - std::rt::lang_start_internal::h103c42a9c4e95084
21: 0x57944ec6c385 - main
22: 0x715f65bb0d90 - <unknown>
23: 0x715f65bb0e40 - __libc_start_main
24: 0x57944ec57c01 - _start
25: 0x0 - <unknown>
In the same fashion as the other bug report, sometimes it runs, though it still spits out those two scaphandre::sensors messages that impacts my ability to parse the output as json with jq:
root@image-explorer:/app# /usr/local/bin/scaphandre --no-header stdout -t3 -s2
scaphandre::sensors: Sysinfo sees 8
Measurement step is: 2s
scaphandre::sensors: Not enough records for socket
Host: 0 W from
package core uncore
Top 5 consumers:
Power PID Exe
No processes found yet or filter returns no value.
------------------------------------------------------------
Host: 3.319567 W from
package core uncore
Socket0 3.312835 W | 3.213493 W 0 W
...etc...
To Reproduce
In my case I'm also custom-building a docker image, such that I can export the results to MQTT. Thus, the only real changes I'm making to the Dockerfile (which I'll attach) are the inclusion of a few more debs, and a custom entrypoint script. In my environment I'm running Microk8s, so I'll attach my deployment bundle there as well. I pieced it together from my own knowledge plus bits from the helm chart.
Expected behavior
Scaphandre to run with no error output, and provide normal results.
Screenshots
Environment
- Linux distribution version Ubuntu 22.04.5 Container (24.04.2 Host)
- Kernel version (output of
uname -r) 6.8.0-56-generic - Microk8s version: v1.32.3
Additional context
pod.yaml.txt (it's actually a yaml file but apparently github doesn't support that filetype?) - This is just a simple pod definition that ignores the container entrypoint so you can jump in and run commands against it.
Dockerfile.txt - I forgot to mention that I also bumped the Rust version to 1.78 since building the container with 1.74 was getting complaints from cargo-platform:
error: failed to compile `cargo-chef v0.1.71`, intermediate artifacts can be found at `/tmp/cargo-installAwsm5X`.
To reuse those artifacts with a future compilation, set the environment variable `CARGO_TARGET_DIR` to that path.
Caused by:
package `cargo-platform v0.1.9` cannot be built because it requires rustc 1.78 or newer, while the currently active rustc version is 1.74.1
Try re-running cargo install with `--locked`
It occurred to me that the difference between container and host OS might be a factor, so I rebuilt the image with the 24.04 Ubuntu base, but that made no difference - behavior is the same.
So looking at that file, it would seem that it's (occasionally) failing to get its own pid, and thus evaluating to None? Or it's getting it's pid, but not able to get info on it? With that in mind, I did try bumping the sysinfo crate to the latest available - 0.34.2 - but it looks like there were a number of changes there that will require some major work with Scaphandre to use it. So I tried with the latest version that would compile, which was just 0.28.4, and no change.
OK, I think I found the trouble - mounting the host /proc over the containers /proc seems to cause confusion, and doesn't appear to be SOP with how others seem to be doing it (instead choosing to mount the hosts /proc at, say, /host/proc in the container). Mounting proc over proc might be doing some PID clobbering. Instead of doing the mount, one can instead use hostPID in the config, which accomplishes what we're looking for in a k8s-friendly way, and seems to stop the occasional failures!
# kubectl explain Pod.spec.hostPID
KIND: Pod
VERSION: v1
FIELD: hostPID <boolean>
DESCRIPTION:
Use the host's pid namespace. Optional: Default to false.
Once I added that to the pod spec and removed the /proc mount, the failures went away.
I would like to leave this open, however, to see if I could maybe get feedback on those warning messages that are being output - I can just 2>/dev/null them and be on my way, but I'm just curious if there would be benefit to fixing those, or if that's perhaps just a side-effect of running in a container:
scaphandre::sensors: Sysinfo sees 4
scaphandre::sensors: Not enough records for socket
fwiw, if I run scaphandre on the host itself, the results there seem to be in agreement with what's reported from the container on the same host, so it at least doesn't appear to be missing anything.