scaphandre high cpu usage and not working scaphandre

Bug description

I am running scaphandre with the helm package from this repo. I am unable to get metrics and when I look at the cpu usage of scaphandre, it's very high.

In the logs with some debug trace I have the following exception :

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"', src/exporters/prometheus.rs:231:60
stack backtrace:
   0: rust_begin_unwind
             at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:483
   1: core::panicking::panic_fmt
             at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/panicking.rs:85
   2: core::option::expect_none_failed
             at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/option.rs:1234
   3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   4: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
   5: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll
   6: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as core::future::future::Future>::poll
   7: tokio::runtime::task::core::CoreStage<T>::poll
   8: tokio::runtime::task::harness::Harness<T,S>::poll
   9: std::thread::local::LocalKey<T>::with
  10: tokio::runtime::thread_pool::worker::Context::run_task
  11: tokio::runtime::thread_pool::worker::Context::run
  12: tokio::macros::scoped_tls::ScopedKey<T>::set
  13: tokio::runtime::thread_pool::worker::run
  14: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  15: tokio::runtime::task::harness::Harness<T,S>::poll
  16: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I tried to use directly : docker run on the kubernetes host (without options --containers) and metrics works.

May 04 '22 23:05 fcomte

Hi @fcomte, since it works without the --containers option I think the errors and CPU usage are from failures matching processes to pods.

For one of the scaphandre containers can you post its cgroup file? The metric has a label with the pid.

scaph_process_power_consumption_microwatts{exe="scaphandre"}

scaph_process_power_consumption_microwatts{app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="scaphandre", cmdline="/usr/local/bin/scaphandreprometheus", exe="scaphandre", instance="192.168.202.134:8080", job="kubernetes-service-endpoints", namespace="default", node="scaph-test-control-plane-vcmhv", pid="56528", service="scaphandre"}

The format of the file depends on the cgroup driver being used by the kubelet.

 cat /proc/56528/cgroup

13:devices:/system.slice/containerd.service/kubepods-burstable-pod97f45548_f07e_40d4_bc2e_102c4e90e95c.slice:cri-containerd:dbe41a89c4a076ac0fc543ba336929aeb2101a932f0b40d3a6a27fe922e221c1
...

This may be fixed by https://github.com/hubblo-org/scaphandre/pull/146 if you want to try it out you could edit the daemonset to use rossf7/scaphandre:kubelet-systemd-group-driver which is the image I built to test the fix.

Hope that helps!

May 07 '22 16:05 rossf7

Thx @rossf7.

I have this format :

root@worker3:~# cat /proc/17081/cgroup 
11:devices:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
10:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
9:blkio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
8:perf_event:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
7:memory:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
6:net_cls,net_prio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
5:freezer:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
4:rdma:/
3:pids:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
2:cpu,cpuacct:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
1:name=systemd:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope
0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9add50d0_5889_4175_8acf_767d7e690c1c.slice/docker-d3ddba91db4c6518ec04834fc0597f6945033de3ea4eefe638b276ee7734318a.scope

May 10 '22 15:05 fcomte

@fcomte Thanks for checking. This format is supported so its something else.

Are there any other errors in the logs? If you haven't already can you enable backtraces via the helm chart?

https://github.com/hubblo-org/scaphandre/blob/8273ced3ddf9b4d2ccaeb4047be81571aa5a9d9e/helm/scaphandre/values.yaml#L19

May 12 '22 07:05 rossf7

running with backtrace full :

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }', src/exporters/mod.rs:653:77
stack backtrace:
   0:     0x55b27a6f0c70 - std::backtrace_rs::backtrace::libunwind::trace::h72c2fb8038f1bbee
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/../../backtrace/src/backtrace/libunwind.rs:96
   1:     0x55b27a6f0c70 - std::backtrace_rs::backtrace::trace_unsynchronized::h1e3b084883f1e78c
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/../../backtrace/src/backtrace/mod.rs:66
   2:     0x55b27a6f0c70 - std::sys_common::backtrace::_print_fmt::h3bf6a7ebf7f0394a
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:79
   3:     0x55b27a6f0c70 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h2e8cb764b7fe02e7
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:58
   4:     0x55b27a713c4c - core::fmt::write::h7a1184eaee6a8644
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/fmt/mod.rs:1080
   5:     0x55b27a6ea442 - std::io::Write::write_fmt::haeeb374d93a67eac
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/io/mod.rs:1516
   6:     0x55b27a6f311d - std::sys_common::backtrace::_print::h1d14a7f6ad632dc8
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:61
   7:     0x55b27a6f311d - std::sys_common::backtrace::print::h301abac8bb2e3e81
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:48
   8:     0x55b27a6f311d - std::panicking::default_hook::{{closure}}::hde0cb80358a6920a
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:208
   9:     0x55b27a6f2dc8 - std::panicking::default_hook::h9b1a691049a0ec8f
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:227
  10:     0x55b27a6f3801 - std::panicking::rust_panic_with_hook::h2bdec87b60580584
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:577
  11:     0x55b27a6f33a9 - std::panicking::begin_panic_handler::{{closure}}::h101ca09d9df5db47
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:484
  12:     0x55b27a6f10dc - std::sys_common::backtrace::__rust_end_short_backtrace::h3bb85654c20113ca
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:153
  13:     0x55b27a6f3369 - rust_begin_unwind
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:483
  14:     0x55b27a711871 - core::panicking::panic_fmt::h48c31e1e3d550146
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/panicking.rs:85
  15:     0x55b27a711693 - core::option::expect_none_failed::h6154dc750ae47ade
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/option.rs:1234
  16:     0x55b27a159640 - scaphandre::exporters::MetricGenerator::gen_all_metrics::h805c069f3624c9df
  17:     0x55b27a0fc86f - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h1d16697c95785191
  18:     0x55b27a149d96 - hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch::h12fe01fff2a43e64
  19:     0x55b27a195f7a - <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll::h9da983696818059e
  20:     0x55b27a15ac7f - <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as core::future::future::Future>::poll::h5b1abd10a6a24814
  21:     0x55b27a1bd74b - tokio::runtime::task::core::CoreStage<T>::poll::h0ca9f963925687e3
  22:     0x55b27a13ee16 - tokio::runtime::task::harness::Harness<T,S>::poll::h0b2baf25945c36ec
  23:     0x55b27a656917 - std::thread::local::LocalKey<T>::with::hf0ac1e870b558692
  24:     0x55b27a65d62c - tokio::runtime::thread_pool::worker::Context::run_task::hfbfd859a67cef510
  25:     0x55b27a65cb25 - tokio::runtime::thread_pool::worker::Context::run::hb51a5504ab306cb3
  26:     0x55b27a643f03 - tokio::macros::scoped_tls::ScopedKey<T>::set::had197133ee0bd498
  27:     0x55b27a65c2a6 - tokio::runtime::thread_pool::worker::run::hfc4ab2b68c73de1b
  28:     0x55b27a663034 - tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::h125350de5086792d
  29:     0x55b27a651648 - tokio::runtime::task::harness::Harness<T,S>::poll::h40adbd54e02c47b9
  30:     0x55b27a64f8a1 - tokio::runtime::blocking::pool::Inner::run::hd58ba66ac24b9445
  31:     0x55b27a63dfce - std::sys_common::backtrace::__rust_begin_short_backtrace::ha179501c99fffb67
  32:     0x55b27a64c256 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h31f72b4d157db3d2
  33:     0x55b27a6f736a - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::hbb39a3e615f69ef9
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
  34:     0x55b27a6f736a - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h79630a683aed732c
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
  35:     0x55b27a6f736a - std::sys::unix::thread::Thread::new::thread_start::h4afaeade0da13617
                               at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys/unix/thread.rs:87
  36:     0x7f654e35d590 - start_thread
  37:     0x7f654e5e0223 - clone
  38:                0x0 - <unknown>

May 12 '22 07:05 fcomte

Thanks @fcomte I've created https://github.com/hubblo-org/scaphandre/pull/173 to fix the parse int error.

There is a problem fetching the pods metadata but with this fix it will retry.

Currently each scaphandre instance lists all pods in the cluster but it only needs the pods on the current node. I'm not certain this is the reason pods can't be listed but it would be more efficient to filter for just the current node.

@bpetit I'd like to work on this enhancement. I'll let you know how I get on.

May 13 '22 11:05 rossf7

Here is the PR to only list pods for the current node.

https://github.com/hubblo-org/scaphandre/pull/174

May 17 '22 15:05 rossf7

scaphandre scaphandre copied to clipboard

high cpu usage and not working scaphandre

Bug description

scaphandre
scaphandre copied to clipboard