snapraid icon indicating copy to clipboard operation
snapraid copied to clipboard

Add debug output to understand calculation of failure probability (FP)

Open saladpanda opened this issue 1 year ago • 2 comments

Please don't actually merge this. I just wanted to publish it here for others facing the same confusion, that I faced.

When running snapraid smart on my machine I was surprised to see a failure probability of 84% even though all 5 SMART values mentioned in the SnapRAID FAQ suggested a perfectly healthy drive. To understand where this high value came from I added some debug ouput along the code path for the calculation and saw that it was based on Load_Cycle_Count (193) which was not mentioned in the FAQ, nor in the Backblaze blogposts.

With this patch the output looks like this:

SnapRAID SMART report:

   Temp  Power   Error   FP Size
      C OnDays   Count        TB  Serial          Device    Disk
 -----------------------------------------------------------------------
     41    569       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=0||value/step=0||value=0||result=0.047450|calculated AFR for SMART value 187: 0.047450 (4.634185%)
|value=235466||value/step=362||value=255||result=1.822567|calculated AFR for SMART value 193: 1.822567 (83.838958%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                        84%  2.0  Z4Z4JCWX        /dev/sda  disk1
     32   2719       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=5093425||value/step=7848||value=255||result=1.822567|calculated AFR for SMART value 193: 1.822567 (83.838958%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                        84%  0.5  71UTC68RT       /dev/sdd  disk2
     44     39       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=0||value/step=0||value=0||result=0.047450|calculated AFR for SMART value 187: 0.047450 (4.634185%)
|value=162||value/step=0||value=0||result=0.000000|calculated AFR for SMART value 193: 0.000000 (0.000000%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                         5%  6.0  ZF200GE8        /dev/sdb  parity
      -      -       -  n/a    -  -               /dev/sdh  -
      -      -       -  n/a    -  -               /dev/sdg  -
      -      -       -  n/a    -  -               /dev/sde  -
      -      -       -  n/a    -  -               /dev/sdf  -
     30    606       0  SSD  0.2  Y5IB61BCKNSX    /dev/sdc  -
     31     36       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                         4%  1.0  S2RXJ9FCB07612  /dev/sdi  -

The FP column is the estimated probability (in percentage) that the disk
is going to fail in the next year.

Probability that at least one disk is going to fail in the next year is 98%.

saladpanda avatar Oct 13 '23 15:10 saladpanda

To compile on ubuntu:

apt install build-essential autoconf
autoreconf -i
./configure
make

# run the binary:
./snapraid

saladpanda avatar Oct 13 '23 16:10 saladpanda

Hi @saladpanda

If I read the output correctly, the Load Cycle Count of your disk is 235,466, which is indeed a high value. In the data I analyzed, this appears to be an indicator of potential failure. However, it's important to note that this doesn't necessarily mean your disk will fail, as each case is unique. It's a good idea to check your hard drive's specifications to see what it's rated for, just to be sure. Hard drives are typically rated for around 600,000 cycles.

See for example this discussion:

https://superuser.com/questions/840851/how-much-load-cycle-count-can-my-hard-drive-hypotethically-sustain

amadvance avatar Oct 20 '23 08:10 amadvance