smartctl_exporter icon indicating copy to clipboard operation
smartctl_exporter copied to clipboard

smartctl_exporter does not notice if previous smartctl did not finish (hangs)

Open NiceGuyIT opened this issue 3 years ago • 2 comments

I ran into an incident recently where smartctl hung due to a bad disk. smartctl_exporter continued to spawn new smartctl processes to monitor the disk even though the previous run did not finish. smartctl_exporter should detect if the previous process finished before starting a new process. I don't know how many processes were spawned before I was aware of the situation. I'm guessing several hundred.

Here's the processes. I noticed the PPID is 1, not the PID of smartctl_exporter.

$ ps -ef | grep smartctl
root       343     1  0 20:33 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       369     1  0 20:33 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       374     1  0 20:33 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       687     1  0 20:35 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       702     1  0 20:35 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
...
root      5483     1  0 21:16 ?        00:00:00 /usr/local/bin/smartctl_exporter
root      5513  5483  0 21:16 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root      5607  5260  0 21:17 pts/2    00:00:00 smartctl -a /dev/sdc
root      6014  4750  0 21:20 pts/1    00:00:00 grep --color=auto smartctl
root     29674     1  0 20:12 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     29675     1  0 20:12 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     29682     1  0 20:12 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     29684     1  0 20:13 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
...
root     32744     1  0 20:32 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     32746     1  0 20:32 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     32751     1  0 20:32 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc

It wasn't until I tried to stop the smartctl_exporter service that systemd tried to clean up the processes. Unfortunately, systemd could not kill the processes either.

Jul 04 21:14:05 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:14:05 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] S.M.A.R.T. output reading error: exit status 4
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure
Jul 04 21:14:43 sns5 systemd[1]: Stopping smartctl exporter service...
Jul 04 21:15:08 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:15:08 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:15:28 sns5 systemd-logind[984]: [🡕] New session 12131 of user suseuser.
Jul 04 21:15:28 sns5 systemd[1]: Started Session 12131 of user suseuser.
Jul 04 21:15:28 sns5 sshd[5042]: pam_unix(sshd:session): session opened for user suseuser by (uid=0)
Jul 04 21:15:43 sns5 sudo[5259]: suseuser : TTY=pts/2 ; PWD=/root ; USER=root ; COMMAND=/bin/bash
Jul 04 21:15:43 sns5 sudo[5259]: pam_unix(sudo-i:session): session opened for user root by suseuser(uid=5000)
Jul 04 21:16:11 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:16:11 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: State 'final-sigterm' timed out. Killing.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29674 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29675 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29682 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29684 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29688 (smartctl) with signal SIGKILL.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4741 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4962 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Failed with result 'timeout'.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29674 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29675 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29682 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29684 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29688 (smartctl) remains running after unit stopped.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4542 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4547 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4741 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4962 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: Stopped smartctl exporter service.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29674 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29675 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29682 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29684 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29688 (smartctl) in control group while starting unit. Ignoring.

What would I like to see happen?

It would be nice if smartctl_exporter checked if the previous process exited before spawning a new smartctl process. Nothing can be done for smartctl hanging on a bad disk but we can prevent smartctl_exporter from making things worse.

NiceGuyIT avatar Aug 08 '22 16:08 NiceGuyIT

Also it will be nice after some threshold to blacklist device that hang smartctl, and put metric, for example smartctl_device_blacklist to 1 This will prevent to issues like described @NiceGuyIT, continue to monitor alive disks and can be handled via blacklist metric

k0ste avatar Aug 08 '22 18:08 k0ste

FWIW, this feels like a pretty critical bug to me...

jantman avatar Mar 22 '23 09:03 jantman