check_smart Intel ssd wearout not reported when almost dead

Similar as #73 .. Disk is failing now but not reported as crit The Smart is

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1668
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       2
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2617 (2 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   071   063   000    Old_age   Always       -       29 (Min/Max 19/38)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       29
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7005511
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       8396
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       1
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       100130
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2617 (2 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7005511
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       71050
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       8255662

The disk info

=== START OF INFORMATION SECTION ===
Model Family:     Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model:     INTEL SSDSC2KB240G8
Serial Number:    :)
LU WWN Device Id: 5 5cd2e4 151dfac3f
Firmware Version: XCV10110
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 16 14:54:42 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Output for plugin

./check_smart.pl -l -i auto -g '/dev/sd*[a-z]'
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|
./check_smart.pl -v
check_smart.pl v6.13.0

Nov 16 '22 14:11 pschonmann

How do you see that the drive is failing now? Any indicators, failures, logs, etc?

As you correctly mentioned, this is the same problem as the linked issue #73. check_smart currently can only read and interpret the "raw values". In this case, the plugin would need to read the "normalized values" which can either be an increasing or decreasing counter (this makes it even more tricky):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0

Nov 16 '22 14:11 Napsty

Same disks in raid1 both 1% lifetime and system is sooo slow. Write about 40M and loadavg about 80 on 6 core machine ( waiting for iops ) When disks replaced Everything works fine.

Nov 16 '22 14:11 pschonmann

Where do you see 1% lifetime in the SMART table?

Nov 16 '22 14:11 Napsty

Sorry, i posted wrong smart There is wrong values SDA - 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0 SDB - 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0

Nov 16 '22 14:11 pschonmann

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0

So value 001 means 1% remaining? Is this one the replacement drive and has 92% remaining?

233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0

Nov 16 '22 14:11 Napsty

Yes, the atribute 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0 is that failing disk and in replaced disk, same model. 233 Media_Wearout_Indicator 0x0032 092 092 000 Old_age Always - 0

The number is decreasing from 100 ... the percent remaining. Info https://serverfault.com/questions/641558/media-wearout-indicator-at-043-reason-to-be-worried

Nov 16 '22 15:11 pschonmann

As the raw value remains 0, this is kinda tricky and cannot be easily integrated into the existing (raw) checks. We would have to add a new check with its own option (e.g. --ssd-wearout) which looks up the normalized value. I don't see myself having time in the next weeks though. Code contributions are welcome :D

Nov 16 '22 15:11 Napsty

Im absolutely fine with it. When it happens, it happens

Nov 16 '22 16:11 pschonmann

Tried to scan all our servers and here are values which can be reported as wear level in pct

177 Wear_Leveling_Count 233 Media_Wearout_Indicator 231 SSD_Life_Left 202 Percent_Lifetime_Remain

Nov 29 '22 12:11 pschonmann

check_smart check_smart copied to clipboard

Intel ssd wearout not reported when almost dead

Output for plugin

check_smart
check_smart copied to clipboard