check_smart icon indicating copy to clipboard operation
check_smart copied to clipboard

Check for ATA errors

Open deric opened this issue 5 months ago • 11 comments

The idea is to inspect ATA device logs, which are also part of smartctl -a /dev/sda output. If the device log contains some errors it means that some operations are already failing, non-zero value would be escalated to a WARNING

kernel log might contain errors like this:

kernel: ata3.00: configured for UDMA/133
kernel: sd 2:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current] 
kernel: sd 2:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
kernel: sd 2:0:0:0: [sdb] tag#1 CDB: Read(10) 28 00 11 40 5b c8 00 00 08 00
kernel: blk_update_request: I/O error, dev sdb, sector 289430472

from man page:

   -l TYPE, --log=TYPE
          Prints various device logs.  The valid arguments to this option are:

          error - [ATA] prints the Summary SMART error log.  SMART disks maintain a log of the most recent five non-trivial errors.  For each of these errors, the disk power-on lifetime
          at  which  the error occurred is recorded, as is the device status (idle, standby, etc) at the time of the error.  For some common types of errors, the Error Register (ER) and
          Status Register (SR) values are decoded and printed as text.  The meanings of these are:
             ABRT:  Command ABoRTed
             AMNF:  Address Mark Not Found
             CCTO:  Command Completion Timed Out
             EOM:   End Of Media
             ICRC:  Interface Cyclic Redundancy Code (CRC) error
             IDNF:  IDentity Not Found
             ILI:   (packet command-set specific)
             MC:    Media Changed
             MCR:   Media Change Request
             NM:    No Media
             obs:   obsolete
             TK0NF: TracK 0 Not Found
             UNC:   UNCorrectable Error in Data
             WP:    Media is Write Protected
          In addition, up to the last five commands that preceded the error are listed, along with a timestamp measured from the start of the corresponding power cycle.   This  is  dis‐
          played in the form Dd+HH:MM:SS.msec where D is the number of days, HH is hours, MM is minutes, SS is seconds and msec is milliseconds.  [Note: this time stamp wraps after 2^32
          milliseconds, or 49 days 17 hours 2 minutes and 47.296 seconds.]  The key ATA disk registers are also recorded in the log.  The final column of the error log is a  text-string
          description  of  the  ATA  command  defined by the Command Register (CR) and Feature Register (FR) values.  Commands that are obsolete in the most current spec are listed like
          this: READ LONG (w/ retry) [OBS-4], indicating that the command became obsolete with or in the ATA-4 specification.  Similarly, the notation [RET-N] is used to indicate that a
          command  was  retired  in  the  ATA-N specification.  Some commands are not defined in any version of the ATA specification but are in common use nonetheless; these are marked
          [NS], meaning non-standard.

The actual output looks like this:

smartctl /dev/sdb -l error
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.10.7] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 16 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 16 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 b0 92 94 40  Error: UNC at LBA = 0x009492b0 = 9736880

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 c0 b0 92 94 40 18      15:11:02.789  READ FPDMA QUEUED
  60 08 b8 c8 92 94 40 17      15:11:02.789  READ FPDMA QUEUED
  60 08 b0 90 92 94 40 16      15:11:02.789  READ FPDMA QUEUED
  60 08 a8 60 13 94 40 15      15:11:02.789  READ FPDMA QUEUED
  60 08 a0 f0 13 94 40 14      15:11:02.789  READ FPDMA QUEUED

Error 15 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 50 90 94 40  Error: UNC at LBA = 0x00949050 = 9736272

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 c0 50 90 94 40 18      15:11:02.778  READ FPDMA QUEUED
  61 00 b8 00 10 94 40 17      15:11:02.778  WRITE FPDMA QUEUED
  61 e8 b0 40 15 74 40 16      15:11:02.778  WRITE FPDMA QUEUED
  60 08 a8 20 94 94 40 15      15:11:02.778  READ FPDMA QUEUED
  61 40 a0 00 10 74 40 14      15:11:02.778  WRITE FPDMA QUEUED

Error 14 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 d0 40 9b 94 40  Error: UNC at LBA = 0x00949b40 = 9739072

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 d0 40 9b 94 40 1a      15:11:02.777  READ FPDMA QUEUED
  60 08 c8 30 9b 94 40 19      15:11:02.777  READ FPDMA QUEUED
  60 08 c0 18 9b 94 40 18      15:11:02.777  READ FPDMA QUEUED
  60 08 b8 08 9b 94 40 17      15:11:02.777  READ FPDMA QUEUED
  60 08 b0 88 9b 94 40 16      15:11:02.777  READ FPDMA QUEUED

Error 13 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 38 a0 97 94 40  Error: UNC at LBA = 0x009497a0 = 9738144

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 38 a0 97 94 40 07      15:11:02.773  READ FPDMA QUEUED
  60 08 30 b0 97 94 40 06      15:11:02.773  READ FPDMA QUEUED
  60 08 28 d8 97 94 40 05      15:11:02.773  READ FPDMA QUEUED
  61 08 18 e0 30 d0 40 03      15:11:02.773  WRITE FPDMA QUEUED
  47 00 01 13 00 00 40 1e      15:11:02.773  READ LOG DMA EXT

Error 12 occurred at disk power-on lifetime: 9377 hours (390 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 f0 70 97 94 40  Error: UNC at LBA = 0x00949770 = 9738096

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 f0 70 97 94 40 1e      15:11:02.771  READ FPDMA QUEUED
  60 08 e8 d8 95 94 40 1d      15:11:02.771  READ FPDMA QUEUED
  60 08 e0 a0 96 94 40 1c      15:11:02.771  READ FPDMA QUEUED
  60 08 d8 c0 96 94 40 1b      15:11:02.771  READ FPDMA QUEUED
  60 08 d0 c0 95 94 40 1a      15:11:02.771  READ FPDMA QUEUED

check_smart output with this modification:

./check_smart.pl -i ata -d /dev/sdb
CRITICAL: Drive  SAMSUNG MZ7LM480HCHP-00003 S/N S1YJNXAG900574:  ATA Error Count: 16, Reallocated_Sector_Ct is non-zero (331), Runtime_Bad_Block is non-zero (331), Uncorrectable_Error_Cnt is non-zero (16), |Reallocated_Sector_Ct=331;;;; Power_On_Hours=75546;;;; Power_Cycle_Count=48;;;; Wear_Leveling_Count=6210;;;; Used_Rsvd_Blk_Cnt_Tot=331;;;; Unused_Rsvd_Blk_Cnt_Tot=2129;;;; Program_Fail_Cnt_Total=0;;;; Erase_Fail_Count_Total=0;;;; Runtime_Bad_Block=331;;;; End-to-End_Error=0;;;; Uncorrectable_Error_Cnt=16;;;; Airflow_Temperature_Cel=37;;;; ECC_Error_Rate=16;;;; Current_Pending_Sector=0;;;; CRC_Error_Count=0;;;; Exception_Mode_Status=0;;;; POR_Recovery_Count=37;;;; Total_LBAs_Written=4595275641341;;;; Total_LBAs_Read=50608004051;;;; SATA_Downshift_Ct=1;;;; Thermal_Throttle_St=0;;;; Timed_Workld_Media_Wear=65535;;;; Timed_Workld_RdWr_Ratio=65535;;;; Timed_Workld_Timer=65535;;;; NAND_Writes=6628479574168;;;; ata_errors=16;;;;

deric avatar Sep 11 '24 18:09 deric