check_smart
check_smart copied to clipboard
Check for ATA errors
The idea is to inspect ATA device logs, which are also part of smartctl -a /dev/sda
output. If the device log contains some errors it means that some operations are already failing, non-zero value would be escalated to a WARNING
kernel log might contain errors like this:
kernel: ata3.00: configured for UDMA/133
kernel: sd 2:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current]
kernel: sd 2:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
kernel: sd 2:0:0:0: [sdb] tag#1 CDB: Read(10) 28 00 11 40 5b c8 00 00 08 00
kernel: blk_update_request: I/O error, dev sdb, sector 289430472
from man page:
-l TYPE, --log=TYPE Prints various device logs. The valid arguments to this option are: error - [ATA] prints the Summary SMART error log. SMART disks maintain a log of the most recent five non-trivial errors. For each of these errors, the disk power-on lifetime at which the error occurred is recorded, as is the device status (idle, standby, etc) at the time of the error. For some common types of errors, the Error Register (ER) and Status Register (SR) values are decoded and printed as text. The meanings of these are: ABRT: Command ABoRTed AMNF: Address Mark Not Found CCTO: Command Completion Timed Out EOM: End Of Media ICRC: Interface Cyclic Redundancy Code (CRC) error IDNF: IDentity Not Found ILI: (packet command-set specific) MC: Media Changed MCR: Media Change Request NM: No Media obs: obsolete TK0NF: TracK 0 Not Found UNC: UNCorrectable Error in Data WP: Media is Write Protected In addition, up to the last five commands that preceded the error are listed, along with a timestamp measured from the start of the corresponding power cycle. This is dis‐ played in the form Dd+HH:MM:SS.msec where D is the number of days, HH is hours, MM is minutes, SS is seconds and msec is milliseconds. [Note: this time stamp wraps after 2^32 milliseconds, or 49 days 17 hours 2 minutes and 47.296 seconds.] The key ATA disk registers are also recorded in the log. The final column of the error log is a text-string description of the ATA command defined by the Command Register (CR) and Feature Register (FR) values. Commands that are obsolete in the most current spec are listed like this: READ LONG (w/ retry) [OBS-4], indicating that the command became obsolete with or in the ATA-4 specification. Similarly, the notation [RET-N] is used to indicate that a command was retired in the ATA-N specification. Some commands are not defined in any version of the ATA specification but are in common use nonetheless; these are marked [NS], meaning non-standard.
The actual output looks like this:
smartctl /dev/sdb -l error
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.10.7] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 16 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 16 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 c0 b0 92 94 40 Error: UNC at LBA = 0x009492b0 = 9736880
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 c0 b0 92 94 40 18 15:11:02.789 READ FPDMA QUEUED
60 08 b8 c8 92 94 40 17 15:11:02.789 READ FPDMA QUEUED
60 08 b0 90 92 94 40 16 15:11:02.789 READ FPDMA QUEUED
60 08 a8 60 13 94 40 15 15:11:02.789 READ FPDMA QUEUED
60 08 a0 f0 13 94 40 14 15:11:02.789 READ FPDMA QUEUED
Error 15 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 c0 50 90 94 40 Error: UNC at LBA = 0x00949050 = 9736272
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 c0 50 90 94 40 18 15:11:02.778 READ FPDMA QUEUED
61 00 b8 00 10 94 40 17 15:11:02.778 WRITE FPDMA QUEUED
61 e8 b0 40 15 74 40 16 15:11:02.778 WRITE FPDMA QUEUED
60 08 a8 20 94 94 40 15 15:11:02.778 READ FPDMA QUEUED
61 40 a0 00 10 74 40 14 15:11:02.778 WRITE FPDMA QUEUED
Error 14 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 d0 40 9b 94 40 Error: UNC at LBA = 0x00949b40 = 9739072
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 d0 40 9b 94 40 1a 15:11:02.777 READ FPDMA QUEUED
60 08 c8 30 9b 94 40 19 15:11:02.777 READ FPDMA QUEUED
60 08 c0 18 9b 94 40 18 15:11:02.777 READ FPDMA QUEUED
60 08 b8 08 9b 94 40 17 15:11:02.777 READ FPDMA QUEUED
60 08 b0 88 9b 94 40 16 15:11:02.777 READ FPDMA QUEUED
Error 13 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 38 a0 97 94 40 Error: UNC at LBA = 0x009497a0 = 9738144
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 38 a0 97 94 40 07 15:11:02.773 READ FPDMA QUEUED
60 08 30 b0 97 94 40 06 15:11:02.773 READ FPDMA QUEUED
60 08 28 d8 97 94 40 05 15:11:02.773 READ FPDMA QUEUED
61 08 18 e0 30 d0 40 03 15:11:02.773 WRITE FPDMA QUEUED
47 00 01 13 00 00 40 1e 15:11:02.773 READ LOG DMA EXT
Error 12 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 f0 70 97 94 40 Error: UNC at LBA = 0x00949770 = 9738096
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 f0 70 97 94 40 1e 15:11:02.771 READ FPDMA QUEUED
60 08 e8 d8 95 94 40 1d 15:11:02.771 READ FPDMA QUEUED
60 08 e0 a0 96 94 40 1c 15:11:02.771 READ FPDMA QUEUED
60 08 d8 c0 96 94 40 1b 15:11:02.771 READ FPDMA QUEUED
60 08 d0 c0 95 94 40 1a 15:11:02.771 READ FPDMA QUEUED
check_smart
output with this modification:
./check_smart.pl -i ata -d /dev/sdb
CRITICAL: Drive SAMSUNG MZ7LM480HCHP-00003 S/N S1YJNXAG900574: ATA Error Count: 16, Reallocated_Sector_Ct is non-zero (331), Runtime_Bad_Block is non-zero (331), Uncorrectable_Error_Cnt is non-zero (16), |Reallocated_Sector_Ct=331;;;; Power_On_Hours=75546;;;; Power_Cycle_Count=48;;;; Wear_Leveling_Count=6210;;;; Used_Rsvd_Blk_Cnt_Tot=331;;;; Unused_Rsvd_Blk_Cnt_Tot=2129;;;; Program_Fail_Cnt_Total=0;;;; Erase_Fail_Count_Total=0;;;; Runtime_Bad_Block=331;;;; End-to-End_Error=0;;;; Uncorrectable_Error_Cnt=16;;;; Airflow_Temperature_Cel=37;;;; ECC_Error_Rate=16;;;; Current_Pending_Sector=0;;;; CRC_Error_Count=0;;;; Exception_Mode_Status=0;;;; POR_Recovery_Count=37;;;; Total_LBAs_Written=4595275641341;;;; Total_LBAs_Read=50608004051;;;; SATA_Downshift_Ct=1;;;; Thermal_Throttle_St=0;;;; Timed_Workld_Media_Wear=65535;;;; Timed_Workld_RdWr_Ratio=65535;;;; Timed_Workld_Timer=65535;;;; NAND_Writes=6628479574168;;;; ata_errors=16;;;;