truenas-spindown-timer
truenas-spindown-timer copied to clipboard
Errors during execution on TrueNAS SCALE / S.M.A.R.T. wakes drives
Hello,
I tried to use the script on the latest version of TrueNAS Scale, but without success. The disks do not react and I get errors when running:
Do you have any idea of the cause?
Thank you!
Hi,
please try to use only the drive names (e.g. sda
) instead of the full path (e.g. /dev/sda
) with the -i
option:
/root/spindown.sh -t 10 -p 5 -i sdd -i sda
You can also have a look at the examples in the README file. Hope this helps :)
Indeed I made a mistake in the syntax. I didn't understand correctly the README, the /dev/ ommision seemed weird to me. Sorry and thanks for the answer !
After checking, I can't spindown correctly on this version of TrueNAS Scale. It seems that reading SMART attributes (temperature?) every 5 minutes resets the counter on unused drives.
Here is a picture of one of the tests,
- SDA is my boot-pool (with system dataset on it), SMART disabled because it is a virtual drive. Used every 5 sec by TrueNAS.
- SDB & SDC are two different unused hard drives (just with empty datasets), SMART on.
- SDD is an SSD permanently used by external services, SMART on.
No matter what APM value I set (disabled, 1, 127, 128), or whether I enable/disable the SMART system service (with Never, standby nor sleep option), the SDB and SDC drives never spindown.
The counter is reset every 300 seconds, which is 100% coincident with the periodic SMART reading.
I also wonder why SDA is "spindown" when it is consistently used
Details of disks use
For more precision, here is the graph of disk usage. The period is not the right one because I forgot to make the capture at the right time, but my nas works like that 24/24 :
You reported that "No matter [...] whether I enable/disable the SMART system service [...] drives never spindown.".
If I understand it correctly, the drives do not spin down, even if you disbale S.M.A.R.T.. Therefore, S.M.A.R.T don't seem to be the problem here.
Might there be any other services running that causes periodic disk reads? Possible candidates can be:
- Running VMs/Jails: https://github.com/ngandrass/truenas-spindown-timer/issues/13
- k3s service: https://www.truenas.com/community/threads/prevent-frequent-reads-from-waking-up-hdds.93176/page-3#post-721803
- You could try to disable the k3s service by issuing:
systemctl disable --now k3s.service
- You could try to disable the k3s service by issuing:
By "SMART service" I meant the system service, which is triggered every 30 minutes (original parameter). I played with each of its parameters without success. At the moment it is completely disabled.
On the drive side, the goal is to leave SMART enabled so that at least a manual SMART check can be activated. It is this option that, in my opinion, generates the data flow seen on the graphs.
I will try to disable it, but i'm convinced that the disks will spindown. When I was using trueNAS builtin spindown timer, it worked when I've disabled that.
No VMs are running on my system, No jails either (TrueNAS scale), No applications, and the dataset for them has not been created. K3S is not running. K3S when activated causes CPU abuse in IDLE on my system but it must not be related to this problem.
With "by-drive" SMART deactivated for SDB :
Thanks for your detailed problem description! :+1:
Your settings for the S.M.A.R.T service as well as the drives individual settings look perfectly fine to me. Sidenote: You are not alone with the K3S problem. I found many people complaining about it consuming a huge amount of idle CPU...
One idea that comes to my mind: Reading the S.M.A.R.T. values should NOT cause Disk I/O. However, self tests will. Can you please post a screenshot of your scheduled S.M.A.R.T. tests? You can find them under Tasks
> S.M.A.R.T. Tests
- All explicitly specified tests will prevent the drives from sleeping.
Thanks for the confirmation for my settings! I think the same, I don't really understand IO generated by tests.
The original TrueNAS spindown timer works on some disks, even with this permanent IO caused by SMART tests (or temperature reading). But the problem is that it depends on the model of the disk because some interpret it as usage.
I have also seen some feedback for K3S, it's still a pity especially since I haven't found any solution even temporary. On my side, it's about 15-20% of permanent use on a 80W TDP CPU. The bill could be more expensive than the 0-1% normally consumed by TrueNAS in IDLE.
Regarding scheduled tests, I think we don't have control over what happens every 5 minutes. In the "data protection" tab, I have the task setup, which has just one simple SMART test per week, so it must not be related :
An idea came to me. Before I found your script, I modified another script I found on the forums to manage the auto-shutdown of my backup server.
I was looking to accomplish this:
- Monday 9:55am: WakeOnLan of my "TrueNAS BACKUP SERVER"
- Monday 10:00am : boot done, SMART tests are executed + scrub pools
- Monday 10:05am : start backup from my main TrueNAS to "TrueNAS BACKUP SERVER
- Monday, later : Backup done
- Monday, later + 15 minutes of IDLE : Auto-shutdown of the "TrueNAS BACKUP SERVER"
Despite SMART tests enabled on 4 disks in 2 pools, this server shuts down perfectly after 15 minutes of IDLE, despite SMART tests every 5 minutes.
When I discovered your script I fired mine, but it didn't work so without thinking I put my script back to work.
I now understand that the logic we used is different between your script and the one I adapted for my use. (my bash skills are limited, I don't understand everything in your script)
Yours uses IOSTAT by disks, while mine uses IOSTAT on pools directly. If this is what is happening, then it is "logical" to have found that my shutdown script works, when yours did not (in my case and with trueNAS scale beta), because pools have zero activity when no one is performing a data transfer.
For information, here is the script :
#!/bin/bash
#Temps d'inactivité minimal pour shutdown (en secondes)
delay=900
#Liste des pools ZFS de TRUENAS
pool1=HDD2TO
pool2=HDD500
pool3=boot-pool
#Création des variables
noscrubs=0
idle=0
while [ true ]; do
#Analyse du traffic des pool en parrallèle. Résultat stocké dans des fichiers afin de tout mesurer en //.
zpool iostat $pool1 $delay 2 | tail +5 > zpool1 &
zpool iostat $pool2 $delay 2 | tail +5 > zpool2 &
#zpool iostat $pool3 $delay 2 | tail +5 > bootpool & # boot-pool écrit en permanence. Uniquement pour test.
#Attente des résultats ("delay" secondes d'attente)
wait
idle1=$( cat zpool1 | egrep -c "0 *0 *0 *0" )
idle2=$( cat zpool2 | egrep -c "0 *0 *0 *0" )
#Si toutes les pools sont en idle (hors boot-pool) alors on passe la variable idle générale à 1.
if [ $idle1 = "1" ] && [ $idle2 = "1" ]; then
idle=1
else
idle=0
fi
#Stockage du statut des pools afin de vérifier qu'aucun scrub n'est en cours.
scrub1=$( echo $(zpool status $pool1 ) | egrep -c "scrub in progress" )
scrub2=$( echo $(zpool status $pool2 ) | egrep -c "scrub in progress" )
scrub3=$( echo $(zpool status $pool3 ) | egrep -c "scrub in progress" )
if [ $scrub1 = "0" ] && [ $scrub2 = "0" ] && [ $scrub3 = "0" ]; then
noscrubs=1
else
noscrubs=0
fi
#echo $scrub1 $scrub2 $scrub3 $idle1 $idle2 #debug : affichage des variables
if [ $noscrubs = "1" ] && [ $idle = "1" ]; then
shutdown now & exit
fi
done
Ok, your scheduled tasks also look absolutely fine.
Differences in how drives handle the S.M.A.R.T. reads may be an issue though. All systems I looked at up until now didn't show such behavior. This, however, does not mean that it is impossible. Disk controllers are quite a pain when it comes to uniformity of their S.M.A.R.T. interfaces...
I was unaware of the possibility to detect I/O on a pool level using zpool iostat
. It should be possible to differentiate between "ZFS pool" and "disk drive" operation mode within the script. This would take some work and proper testing though.
Taking this idea a little further: Can you think of any case in which a drive is NOT part of a zfs pool but needs to be spun down anyway? If not, we could theoretically fully replace disk level I/O detection with the pool based approach. However, I'm not very familiar with ZFS¹ and not aware which operations are factored in by zpool iostat
and which are not... I'm a little worried to overlook some cases that might lead to missing important I/O, e.g. during resilvering, scrubs, or some other low-level operations, at this point.
Using ONLY the pool based approach should also allow us to replace the differentiation between TrueNAS Core and SCALE (camcontrol
vs hdparm
) as a positive side effect :+1:
¹: I'm using TrueNAS only within my basic 8 drive home setup and have not much experience with more sophisticated setups.
One case that comes to my mind is that you might want to hold spare uninitialized drives and like to spin them down until they are added to a pool (e.g. as a replacement for a failed drive). This might be useful for a remote setup that you cannot access regularly.
I'm not sure how many people would do something like this though...
I'm seeing a lot of the same I/O issues as @jojolll on the 22.12.0 SCALE release. I didn't see the same issues on CORE (with the exact same drives)
I wonder if this is some kind of bug in SCALE.
Also having the same issues as @jojolll and @sebirdman on 22.12.0 Scale.
Edit: my mistake, realized I have a few K3S applications pointed at my spinning rust pool, I'm sure that's the issue.
@jojolll @sebirdman @gpatkinson Do you still experience the described problems?
If so I'd try to introduce an option to switch between disk (iostat
) and pool (zpool iostat
) operation mode. This, however, would require quite some adjustments to the script. I could take this as an opportunity to more or less refactor the whole script and get rid of some things that bother me since day one like the quite unintuitive CLI...
Hey everyone,
I created a version that is able to work with both plain device identifiers (i.e. on a per-disk basis) and zpool names (i.e. on a zfs pool basis). You can find it in the feature/zpool-iostat-mode branch.
I was able to successfully test it on my own TrueNAS CORE setup but I'm missing more tests, especially on SCALE. Therefore I would like to ask you to test if this version works for you and fixes the above described problems.
As the README still has to be updated here is a quick summary of the changes:
- A new CLI argument was introduced to switch between
disk
andzpool
operation mode:-u MODE
- When no operation mode is explicitly given, the script works in
disk
mode. This completely ignores zfs pools and works as before. - When operation mode is set to
zpool
by supplying-u zpool
, the script now operates on a per-zpool basis. I/O is monitored for the pool as a whole and disks are only spun down if the complete pool was idle for a given number of seconds. ZFS pools are either detected automatically or can be supplied manually (see help text for-i
and-m
). - Drives are referenced by GPTID in zpools (at least on CORE :pray:. Cloud become a problem on SCALE or other systems since I'm not sure how reliable GPTIDs are used...). Therefore the script creates a mapping from GPTIDs to device identifiers using
glabel
to be able to spin down the drives of a pool usingcamcontrol
(CORE) orhdparm
(SCALE)
Here is an example of translating a disk-based script invocation to a zpool based one:
-
disk:
./spindown-timer.sh -v -t 3600 -p 600 -i da0 -i ada7
-
zpool:
./spindown-timer.sh -v -t 3600 -p 600 -u zpool -i freenas-boot -i ssd
Keep in mind that you can always test the script in dry run mode (-d
) to circumvent a large number of unwanted spindowns during testing :)
Thanks in advance for your help with testing! :+1:
Hi! It's been a while since I've thought about it, I had finally let my disks run 24/24...
On this version, it seems that TrueNAS scale does not have the prerequisites installed. An error appears in the logs, and there is no spindown happening. I think it does not detect physical disks (in zpool mode)
root@truenas[~]# ./v2.sh -v -t 30 -p 10 -u zpool -i boot-pool -i SSD
[2023-02-18 16:19:13] Running HDD Spindown Timer version 2.1.0
[2023-02-18 16:19:13] Operation mode: zpool
[2023-02-18 16:19:13] Using disk parameter tool: hdparm
./v2.sh: line 168: glabel: command not found
[2023-02-18 16:19:13] Detected GPTID to disk identifier mappings:
[2023-02-18 16:19:13] Ignoring zfs pool: boot-pool
[2023-02-18 16:19:13] Ignoring zfs pool: SSD
[2023-02-18 16:19:13] Detected zfs pool: HDDNAS
[2023-02-18 16:19:13] Detecting disks in pool: HDDNAS
[2023-02-18 16:19:13] Monitoring drives with a timeout of 30 seconds:
[2023-02-18 16:19:13] I/O check sample period: 10 sec
[2023-02-18 16:19:13] Drive timeouts:
[2023-02-18 16:19:23] Drive timeouts:
[2023-02-18 16:19:33] Drive timeouts:
[2023-02-18 16:19:43] Drive timeouts:
[2023-02-18 16:19:53] Drive timeouts:
[2023-02-18 16:20:03] Drive timeouts:
[2023-02-18 16:20:13] Drive timeouts:
[2023-02-18 16:20:23] Drive timeouts:
The disk-based version seems to detect correctly, despite the error also present on the absence of glabel.
root@truenas[~]# ./v2.sh -v -t 30 -p 10 -i sda2 -i sda1 -d
[2023-02-18 16:24:49] Running HDD Spindown Timer version 2.1.0
[2023-02-18 16:24:49] Performing a dry run...
[2023-02-18 16:24:49] Operation mode: disk
[2023-02-18 16:24:49] Using disk parameter tool: hdparm
./v2.sh: line 168: glabel: command not found
[2023-02-18 16:24:49] Detected GPTID to disk identifier mappings:
[2023-02-18 16:24:49] Detected drive sda as ATA device
[2023-02-18 16:24:49] Detected drive sdb as ATA device
[2023-02-18 16:24:49] Detected drive sdc as ATA device
[2023-02-18 16:24:49] Monitoring drives with a timeout of 30 seconds: sda sdb sdc
[2023-02-18 16:24:49] I/O check sample period: 10 sec
[2023-02-18 16:24:49] Drive timeouts: [sda]=30 [sdb]=30 [sdc]=30
[2023-02-18 16:25:00] Drive timeouts: [sda]=20 [sdb]=20 [sdc]=30
[2023-02-18 16:25:10] Drive timeouts: [sda]=10 [sdb]=10 [sdc]=30
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[2023-02-18 16:25:20] Drive is already spun down: sda
[2023-02-18 16:25:20] Would spin down idle drive: sdb. No spindown was performed (dry run).
[2023-02-18 16:25:20] Drive timeouts: [sda]=30 [sdb]=30 [sdc]=20
[2023-02-18 16:25:30] Drive timeouts: [sda]=20 [sdb]=20 [sdc]=10
[2023-02-18 16:25:40] Would spin down idle drive: sdc. No spindown was performed (dry run).
[2023-02-18 16:25:40] Drive timeouts: [sda]=10 [sdb]=10 [sdc]=30
EDIT : SG_IO: bad/missing sense data, sb[]:
Does not seem like a normal message. However, it does not seem to be blocking the script
EDIT 2 : My TrueNAS Scale version is TrueNAS SCALE Bluefin [release] 22.12.0
Thanks for the quick test! Could you provide the output of zpool list -v
so that I can determine which device identifiers are used if glabel
is absent? Thanks!
In theory we could skip the whole GPTID detection for disk
operation mode since it does not look up those values anyway.
Regarding the SG_IO
message: This seems to originate from spindown_timer.sh#L406. Could you try to execute the following command for your drives sda
, sdb
, and sdc
: hdparm -C /dev/sdX
?
Ok, it seems that GPTID is used on BSD (CORE) only. SCALE seems to identify the disks inside the pool by partuuid.
I updated the script accordingly. With commit 1d1bed03b77704abb2a0285db5a9dfc22a8669a4 the zpool
operation mode should (or is it more appropriate to say "could" :pray:) work on SCALE. Please try again :)
EDIT: If this updated version works for you the above requested zpool list -v
output is no longer required.
Thanks for your reply, the new version work in both modes 🥇
The error message actually comes from the SDA disk, the boot-pool. It's a virtual disk in my case (TrueNAS is virtualized by Proxmox). I think Hdparm can't read its status correctly beacause it's a virtual disk. However, the script seems to ignore the error as it is. In any case I don't see why to ask the script to include this disk, once excluded there is no more error.
root@truenas[~]# hdparm -C /dev/sda
/dev/sda:
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
drive state is: standby
root@truenas[~]# hdparm -C /dev/sdb
/dev/sdb:
drive state is: active/idle
root@truenas[~]# hdparm -C /dev/sdc
/dev/sdc:
drive state is: active/idle
root@truenas[~]# hdparm -C /dev/sdd
Awesome! I'll wait for the test results from @gpatkinson, update the README and release a new version when no further error arise. Thanks for testing :tada:
Regarding the virtual disk: Yeah, hdparm will fail for virtual disks and cannot spin them down.
Wow, great work @ngandrass ! Mine appears to work as well. In this first test, sdb has per-disk SMART off and sde has it on. The script does not care either way:
And here's the non-dry-run, per-disk SMART now enabled on both disks of the pool:
Just for clarity:
- trunas scale bluefin 22.12.0 on bare metal
- no VMs running
- k3s and all containers stopped manually
This script and new iostat zpool functionality appears to be working. Will be testing soon how k3s and running containers affect the script. @jojolll -- have you tested anything with applications / containers at this point? (not running on the pool to spin down, obviously, I seemed to be having issues with just having k3s running on the system)
Awesome! Nice to hear that everything works well for you :+1: Thanks for testing!
I updated the README and created #18 to prepare for release. If no show stoppers arise I could release the new version later today. Any objections?
Wow, great work @ngandrass ! Mine appears to work as well. In this first test, sdb has per-disk SMART off and sde has it on. The script does not care either way:
And here's the non-dry-run, per-disk SMART now enabled on both disks of the pool:
Just for clarity:
* trunas scale bluefin 22.12.0 on bare metal * no VMs running * k3s and all containers stopped manually
This script and new iostat zpool functionality appears to be working. Will be testing soon how k3s and running containers affect the script. @jojolll -- have you tested anything with applications / containers at this point? (not running on the pool to spin down, obviously, I seemed to be having issues with just having k3s running on the system)
On my side, SMART is a problem on my recent disk (Seagate IronWolf 4Tb). If I leave the option activated in the disk menu, even if the script works well, the disk spins up after 5 minutes maximum of spin down (when TrueNAS reads the SMART). On another disk I don't have the problem, so it seems to be only related to the disk behavior.
I have not tested with K3S, but there is no reason why there could be a data exchange on disks not used by the services :) When I enabled it on my SSD, I never heard my HDD working, which means there was no IO.
I'll see if I can find a way for SMART not to wake up my disk, and if not I'll leave SMART disabled and only run a manual test every now and then :(
I configured my S.M.A.R.T. daemon to execute tests every sunday night. It wakes the drives only for this task and leaves them sleeping for the rest of the time. At least on CORE... There seem to be a lot of behavioral differences between CORE and SCALE :disappointed:
Just to make sure: Note that having an active shell session inside a directory that resides on a disk can cause I/O and therefore reset the spindown timer.
t to make sure: Note that having an active shell session inside a directory that resides on a disk can cause I/O and therefo
In my case, the timer is never reset. There is no I/O on the pool even on SMART read. I think it is the temperature reading on the disk, every 5 minutes, that causes this.
I will try to add options to SMART.
@ngandrass @jojolll one thing i hadn't noticed until this morning is that using the Truenas GUI to read the disk settings (not even hitting the 'edit' button) pulls a disk out of standby/spin-down. i had one disk of my mirrored pool spun down this morning and the other active. As soon as I used the GUI to look at the settings on the spun-down drive, it became active again. I guess this makes sense as it needs to read these settings from the drive itself, I just hadn't considered it and this could confound testing in the future, FYI.
pulls a disk out of standby/spin-down. i had one disk of my mirrored pool spun down this morning and the other active. As soon as I used the GUI to look at the settings on the spun-down drive, it became active again. I guess this makes sense as it needs to read these settings from the drive itself, I just hadn't considered it and this could confound testing in the future, FYI.
I've noticed that too, but in my case it's something else since even after leaving everything for 30 minutes, if the script puts my HDD to sleep, it will only stay there for a short time.
This is rather interesting. I tried to reproduce this behavior but weren't able to. Even reading the S.M.A.R.T. data of a drive via smartctl -a /dev/ada0
did not wake it.
I'm using primarily Western Digital Red drives on TrueNAS CORE. IMHO it's possible that this behavior depends on both whether CORE or SCALE is used and how the controller of the respective drive handles S.M.A.R.T. reads...