bees
bees copied to clipboard
Systemd service is forced to kill beesd when executing stop
Does the program handles gracefully sigterm?
20:24:08 gentoo systemd[1]: [email protected]: State 'stop-sigterm' timed out. Killing.
20:24:08 gentoo systemd[1]: [email protected]: Killing process 29713 (beesd) with signal SIGKILL.
20:24:08 gentoo systemd[1]: [email protected]: Killing process 29741 (bees) with signal SIGKILL.
20:24:08 gentoo systemd[1]: [email protected]: Main process exited, code=killed, status=9/KILL
20:24:10 gentoo systemd[1]: [email protected]: Failed with result 'timeout'.
It's safe to kill bees the hard way at any time... It may just forget about what it was doing the last 15 minutes and rescan those file data the next to you start bees. But it won't harm your data.
BTW: There's commits in the master branch that should enable it to gracefully exit on SIGTERM. I haven't verified that yet. Which version do you use?
'Gracefully' depends on what you think is 'graceful'...
When bees (master branch) receives SIGTERM it tries to complete the current ioctl calls, save dedupe scan/crawl progress, save the in-memory hash table to disk, and exit. On hosts with spinning disks, big hash tables, many worker threads, big RAM, big vm.dirty_ratio, busy host filesystems, or combinations of the above, the SIGTERM shutdown can take over 20 minutes. On a 32GB machine with 9 threads and 12TB spinning-disk filesystem it's usually more like 2-5 minutes. On an 8-core 16GB machine with a 1TB NVME storage and 1GB hash table it's no more than 10 seconds. If the filesystem is idle then bees will have completed all the necessary disk writes in the background so SIGTERM can be handled in a few milliseconds.
I suspect you will need to adapt [email protected] with a timeout to match your system configuration and workload (i.e. make the timeout longer until bees doesn't spuriously time out any more).
If you do send SIGKILL to bees it will just resume from the last point saved when bees starts up again, and repeat the scan of some of the files that were previously scanned between the last save point and the SIGKILL. Progress is saved every 15 minutes and the hash table is slowly updated over a period of about 30 minutes per GB to avoid slamming the disks with the full hash table update rate. When bees gets SIGTERM it tries to force all pending disk updates to happen at once, which can flood the host RAM with dirty pages and trigger IO throttling in Linux VFS, among other undesirable outcomes.
All the on-disk bees file formats are designed explicitly to survive host kernel crashes and power failures. SIGKILL is significantly easier than those. If you prefer to have your latency at startup instead of shutdown, you can skip SIGTERM and proceed directly to SIGKILL.
If SIGKILL times out too (i.e. the bees process, pid 29713 in the above example, survives the SIGKILL) then you might be experiencing a kernel bug. The LOGICAL_INO ioctl does not have bounded running time and cannot be interrupted by signals--you just have to wait until kernel threads executing that ioctl finish their work, notice the signal, and exit. Some fixes for assorted deadlock cases were pushed to stable kernels in the last month (not every btrfs kernel deadlock is related to bees, some just happen on any heavy IO workload).
I noticed that when sending SIGUSR1 beesd instantaneously dies. Perhaps some method to handle other signals would be useful?
Sure, just add them to the list in block_term_signal...
What is SIGUSR1 useful for?
It was a mistake to send SIGUSR1. I wanted to send a SIGTERM too beesd but, well pressed wrong.