check_smart
check_smart copied to clipboard
Intel ssd wearout not reported when almost dead
Similar as #73 .. Disk is failing now but not reported as crit The Smart is
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1668
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 2
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2617 (2 65535)
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error_Count 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Drive_Temperature 0x0022 071 063 000 Old_age Always - 29 (Min/Max 19/38)
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 29
197 Pending_Sector_Count 0x0012 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 7005511
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 8396
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 1
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 100130
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 092 092 000 Old_age Always - 0
234 Thermal_Throttle_Status 0x0032 100 100 000 Old_age Always - 0/0
235 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2617 (2 65535)
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 7005511
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 71050
243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 8255662
The disk info
=== START OF INFORMATION SECTION ===
Model Family: Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model: INTEL SSDSC2KB240G8
Serial Number: :)
LU WWN Device Id: 5 5cd2e4 151dfac3f
Firmware Version: XCV10110
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Nov 16 14:54:42 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Output for plugin
./check_smart.pl -l -i auto -g '/dev/sd*[a-z]'
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|
./check_smart.pl -v
check_smart.pl v6.13.0
How do you see that the drive is failing now? Any indicators, failures, logs, etc?
As you correctly mentioned, this is the same problem as the linked issue #73. check_smart currently can only read and interpret the "raw values". In this case, the plugin would need to read the "normalized values" which can either be an increasing or decreasing counter (this makes it even more tricky):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
[...]
233 Media_Wearout_Indicator 0x0032 092 092 000 Old_age Always - 0
Same disks in raid1 both 1% lifetime and system is sooo slow. Write about 40M and loadavg about 80 on 6 core machine ( waiting for iops ) When disks replaced Everything works fine.
Where do you see 1% lifetime in the SMART table?
Sorry, i posted wrong smart There is wrong values SDA - 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0 SDB - 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0
So value 001 means 1% remaining? Is this one the replacement drive and has 92% remaining?
233 Media_Wearout_Indicator 0x0032 092 092 000 Old_age Always - 0
Yes, the atribute 233 Media_Wearout_Indicator 0x0032 001 001 000 Old_age Always - 0 is that failing disk and in replaced disk, same model. 233 Media_Wearout_Indicator 0x0032 092 092 000 Old_age Always - 0
The number is decreasing from 100 ... the percent remaining. Info https://serverfault.com/questions/641558/media-wearout-indicator-at-043-reason-to-be-worried
As the raw value remains 0, this is kinda tricky and cannot be easily integrated into the existing (raw) checks. We would have to add a new check with its own option (e.g. --ssd-wearout
) which looks up the normalized value.
I don't see myself having time in the next weeks though. Code contributions are welcome :D
Im absolutely fine with it. When it happens, it happens
Tried to scan all our servers and here are values which can be reported as wear level in pct
177 Wear_Leveling_Count 233 Media_Wearout_Indicator 231 SSD_Life_Left 202 Percent_Lifetime_Remain