scaphandre
scaphandre copied to clipboard
high cpu usage and not working scaphandre
Bug description
I am running scaphandre with the helm package from this repo. I am unable to get metrics and when I look at the cpu usage of scaphandre, it's very high.
In the logs with some debug trace I have the following exception :
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"', src/exporters/prometheus.rs:231:60
stack backtrace:
0: rust_begin_unwind
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:483
1: core::panicking::panic_fmt
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/panicking.rs:85
2: core::option::expect_none_failed
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/option.rs:1234
3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
4: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
5: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll
6: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as core::future::future::Future>::poll
7: tokio::runtime::task::core::CoreStage<T>::poll
8: tokio::runtime::task::harness::Harness<T,S>::poll
9: std::thread::local::LocalKey<T>::with
10: tokio::runtime::thread_pool::worker::Context::run_task
11: tokio::runtime::thread_pool::worker::Context::run
12: tokio::macros::scoped_tls::ScopedKey<T>::set
13: tokio::runtime::thread_pool::worker::run
14: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
15: tokio::runtime::task::harness::Harness<T,S>::poll
16: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I tried to use directly : docker run on the kubernetes host (without options --containers) and metrics works.
Hi @fcomte, since it works without the --containers
option I think the errors and CPU usage are from failures matching processes to pods.
For one of the scaphandre containers can you post its cgroup file? The metric has a label with the pid.
scaph_process_power_consumption_microwatts{exe="scaphandre"}
scaph_process_power_consumption_microwatts{app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="scaphandre", cmdline="/usr/local/bin/scaphandreprometheus", exe="scaphandre", instance="192.168.202.134:8080", job="kubernetes-service-endpoints", namespace="default", node="scaph-test-control-plane-vcmhv", pid="56528", service="scaphandre"}
The format of the file depends on the cgroup driver being used by the kubelet.
cat /proc/56528/cgroup
13:devices:/system.slice/containerd.service/kubepods-burstable-pod97f45548_f07e_40d4_bc2e_102c4e90e95c.slice:cri-containerd:dbe41a89c4a076ac0fc543ba336929aeb2101a932f0b40d3a6a27fe922e221c1
...
This may be fixed by https://github.com/hubblo-org/scaphandre/pull/146 if you want to try it out you could edit the daemonset to use rossf7/scaphandre:kubelet-systemd-group-driver
which is the image I built to test the fix.
Hope that helps!
Thx @rossf7.
I have this format :
root@worker3:~# cat /proc/17081/cgroup
11:devices:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
10:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
9:blkio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
8:perf_event:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
7:memory:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
6:net_cls,net_prio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
5:freezer:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
4:rdma:/
3:pids:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
2:cpu,cpuacct:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
1:name=systemd:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
@fcomte Thanks for checking. This format is supported so its something else.
Are there any other errors in the logs? If you haven't already can you enable backtraces via the helm chart?
https://github.com/hubblo-org/scaphandre/blob/8273ced3ddf9b4d2ccaeb4047be81571aa5a9d9e/helm/scaphandre/values.yaml#L19
running with backtrace full :
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }', src/exporters/mod.rs:653:77
stack backtrace:
0: 0x55b27a6f0c70 - std::backtrace_rs::backtrace::libunwind::trace::h72c2fb8038f1bbee
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/../../backtrace/src/backtrace/libunwind.rs:96
1: 0x55b27a6f0c70 - std::backtrace_rs::backtrace::trace_unsynchronized::h1e3b084883f1e78c
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/../../backtrace/src/backtrace/mod.rs:66
2: 0x55b27a6f0c70 - std::sys_common::backtrace::_print_fmt::h3bf6a7ebf7f0394a
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:79
3: 0x55b27a6f0c70 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h2e8cb764b7fe02e7
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:58
4: 0x55b27a713c4c - core::fmt::write::h7a1184eaee6a8644
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/fmt/mod.rs:1080
5: 0x55b27a6ea442 - std::io::Write::write_fmt::haeeb374d93a67eac
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/io/mod.rs:1516
6: 0x55b27a6f311d - std::sys_common::backtrace::_print::h1d14a7f6ad632dc8
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:61
7: 0x55b27a6f311d - std::sys_common::backtrace::print::h301abac8bb2e3e81
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:48
8: 0x55b27a6f311d - std::panicking::default_hook::{{closure}}::hde0cb80358a6920a
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:208
9: 0x55b27a6f2dc8 - std::panicking::default_hook::h9b1a691049a0ec8f
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:227
10: 0x55b27a6f3801 - std::panicking::rust_panic_with_hook::h2bdec87b60580584
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:577
11: 0x55b27a6f33a9 - std::panicking::begin_panic_handler::{{closure}}::h101ca09d9df5db47
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:484
12: 0x55b27a6f10dc - std::sys_common::backtrace::__rust_end_short_backtrace::h3bb85654c20113ca
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:153
13: 0x55b27a6f3369 - rust_begin_unwind
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:483
14: 0x55b27a711871 - core::panicking::panic_fmt::h48c31e1e3d550146
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/panicking.rs:85
15: 0x55b27a711693 - core::option::expect_none_failed::h6154dc750ae47ade
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/option.rs:1234
16: 0x55b27a159640 - scaphandre::exporters::MetricGenerator::gen_all_metrics::h805c069f3624c9df
17: 0x55b27a0fc86f - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h1d16697c95785191
18: 0x55b27a149d96 - hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch::h12fe01fff2a43e64
19: 0x55b27a195f7a - <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll::h9da983696818059e
20: 0x55b27a15ac7f - <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as core::future::future::Future>::poll::h5b1abd10a6a24814
21: 0x55b27a1bd74b - tokio::runtime::task::core::CoreStage<T>::poll::h0ca9f963925687e3
22: 0x55b27a13ee16 - tokio::runtime::task::harness::Harness<T,S>::poll::h0b2baf25945c36ec
23: 0x55b27a656917 - std::thread::local::LocalKey<T>::with::hf0ac1e870b558692
24: 0x55b27a65d62c - tokio::runtime::thread_pool::worker::Context::run_task::hfbfd859a67cef510
25: 0x55b27a65cb25 - tokio::runtime::thread_pool::worker::Context::run::hb51a5504ab306cb3
26: 0x55b27a643f03 - tokio::macros::scoped_tls::ScopedKey<T>::set::had197133ee0bd498
27: 0x55b27a65c2a6 - tokio::runtime::thread_pool::worker::run::hfc4ab2b68c73de1b
28: 0x55b27a663034 - tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::h125350de5086792d
29: 0x55b27a651648 - tokio::runtime::task::harness::Harness<T,S>::poll::h40adbd54e02c47b9
30: 0x55b27a64f8a1 - tokio::runtime::blocking::pool::Inner::run::hd58ba66ac24b9445
31: 0x55b27a63dfce - std::sys_common::backtrace::__rust_begin_short_backtrace::ha179501c99fffb67
32: 0x55b27a64c256 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h31f72b4d157db3d2
33: 0x55b27a6f736a - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::hbb39a3e615f69ef9
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
34: 0x55b27a6f736a - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h79630a683aed732c
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
35: 0x55b27a6f736a - std::sys::unix::thread::Thread::new::thread_start::h4afaeade0da13617
at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys/unix/thread.rs:87
36: 0x7f654e35d590 - start_thread
37: 0x7f654e5e0223 - clone
38: 0x0 - <unknown>
Thanks @fcomte I've created https://github.com/hubblo-org/scaphandre/pull/173 to fix the parse int error.
There is a problem fetching the pods metadata but with this fix it will retry.
Currently each scaphandre instance lists all pods in the cluster but it only needs the pods on the current node. I'm not certain this is the reason pods can't be listed but it would be more efficient to filter for just the current node.
@bpetit I'd like to work on this enhancement. I'll let you know how I get on.
Here is the PR to only list pods for the current node.
https://github.com/hubblo-org/scaphandre/pull/174