smartctl_exporter
smartctl_exporter copied to clipboard
smartctl_exporter does not notice if previous smartctl did not finish (hangs)
I ran into an incident recently where smartctl hung due to a bad disk. smartctl_exporter continued to spawn new smartctl processes to monitor the disk even though the previous run did not finish. smartctl_exporter should detect if the previous process finished before starting a new process. I don't know how many processes were spawned before I was aware of the situation. I'm guessing several hundred.
Here's the processes. I noticed the PPID is 1, not the PID of smartctl_exporter.
$ ps -ef | grep smartctl
root 343 1 0 20:33 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 369 1 0 20:33 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 374 1 0 20:33 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 687 1 0 20:35 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 702 1 0 20:35 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
...
root 5483 1 0 21:16 ? 00:00:00 /usr/local/bin/smartctl_exporter
root 5513 5483 0 21:16 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 5607 5260 0 21:17 pts/2 00:00:00 smartctl -a /dev/sdc
root 6014 4750 0 21:20 pts/1 00:00:00 grep --color=auto smartctl
root 29674 1 0 20:12 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 29675 1 0 20:12 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 29682 1 0 20:12 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 29684 1 0 20:13 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
...
root 32744 1 0 20:32 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 32746 1 0 20:32 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root 32751 1 0 20:32 ? 00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
It wasn't until I tried to stop the smartctl_exporter service that systemd tried to clean up the processes. Unfortunately, systemd could not kill the processes either.
Jul 04 21:14:05 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:14:05 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] S.M.A.R.T. output reading error: exit status 4
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure
Jul 04 21:14:43 sns5 systemd[1]: Stopping smartctl exporter service...
Jul 04 21:15:08 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:15:08 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:15:28 sns5 systemd-logind[984]: [🡕] New session 12131 of user suseuser.
Jul 04 21:15:28 sns5 systemd[1]: Started Session 12131 of user suseuser.
Jul 04 21:15:28 sns5 sshd[5042]: pam_unix(sshd:session): session opened for user suseuser by (uid=0)
Jul 04 21:15:43 sns5 sudo[5259]: suseuser : TTY=pts/2 ; PWD=/root ; USER=root ; COMMAND=/bin/bash
Jul 04 21:15:43 sns5 sudo[5259]: pam_unix(sudo-i:session): session opened for user root by suseuser(uid=5000)
Jul 04 21:16:11 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:16:11 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: State 'final-sigterm' timed out. Killing.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29674 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29675 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29682 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29684 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29688 (smartctl) with signal SIGKILL.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4741 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4962 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Failed with result 'timeout'.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29674 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29675 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29682 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29684 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29688 (smartctl) remains running after unit stopped.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4542 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4547 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4741 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4962 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: Stopped smartctl exporter service.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29674 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29675 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29682 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29684 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29688 (smartctl) in control group while starting unit. Ignoring.
What would I like to see happen?
It would be nice if smartctl_exporter checked if the previous process exited before spawning a new smartctl process. Nothing can be done for smartctl hanging on a bad disk but we can prevent smartctl_exporter from making things worse.
Also it will be nice after some threshold to blacklist device that hang smartctl, and put metric, for example smartctl_device_blacklist to 1
This will prevent to issues like described @NiceGuyIT, continue to monitor alive disks and can be handled via blacklist metric
FWIW, this feels like a pretty critical bug to me...