bees argument to allow bees to run in non-daemon mode

Hello,

please implement an argument in bees to allow bees to be executed in non-daemon mode. When the crawlers are idle, save the state & terminate bees. This would allow less memory usage and bees could be executed as periodic task (daily, weekly). #36 https://github.com/Zygo/bees/issues/36#issuecomment-446276923

Jan 13 '19 20:01 dim-geo

If you run bees directly without the beesd wrapper, it already does that. The core daemon doesn't fork into daemon mode. But it won't exit when idle, that's not implemented yet. And also, that's not really depending on daemon vs foreground mode.

Jan 13 '19 20:01 kakra

It's possible to check all the crawlers to see if they're all idle (deferred), and exit if that is true. The complicating factor is "all the crawlers":

during startup it's the empty set or an incomplete set
crawlers don't all finish at the same time
crawlers don't all start at the same time: new subvols appear over time, and we restart old ones
bees changes the filesystem itself, which can give crawlers work to do in later passes

If we don't solve those issues, bees might never meet the requirements of "idle", or bees might exit prematurely because it hit the "idle" condition when there was still lots of work to do. In particular, if there is continuous activity on the filesystem (e.g. bees writing logs there), bees might never exit, which would seem to defeat the purpose of the feature.

Maybe we can simplify the exit condition: record the current filesystem transid at startup, and stop bees when every crawler's min_transid is equal to or above that transid. That condition doesn't care about new subvols, individual crawler lifetimes, changing data, or interactions with bees itself--it just stops bees when all crawlers have caught up with the filesystem as it was at the moment bees started up.

This wouldn't be a complete dedupe--bees relies on pass N+1 to clean up temporary copies done in pass N. The user could just run bees in single-pass mode multiple times to work around that--usually only two or three passes are needed, and the amount of data affected drops very rapidly on each pass.

Jan 14 '19 16:01 Zygo

@Zygo So I'd imagine something like starting bees with a --one-shot option, it determines the current transid of the filesystem, and crawls the backlog until this transid. Is that your idea?

But I think there should be another option, something like --max-walltime=8h so it would stop after 8 hours no matter whether it reached the transid goal.

Jan 14 '19 17:01 kakra

Walltime doesn't need bees to implement:

  perl -e 'alarm(8*3600); exec @ARGV' bees --one-shot /your/fs

(well, that would work if bees handled SIGALRM the same way as SIGTERM)

We might want bees to exit after freeing some number of GB, but that can also be done from outside bees too (have a process watch df and kill bees when it crosses the desired threshold).

Maybe --one-pass is a better option name?

Jan 14 '19 19:01 Zygo

@Zygo Well, I used one-shot because it's what's used already in systemd (Type=oneshot) and Gentoo's package manager (emerge --oneshot). Probably it's used by other software, too. But as I just discovered it's actually "oneshot" not "one-shot". The short option for this in emerge is -1.

Basically, in systemd it means "this is a script, run once, then consider the service done". In emerge it means "just install the package(s), don't add it to the preferred packages, this will auto-remove it next time nothing depends on it any more".

Jan 14 '19 20:01 kakra

(well, that would work if bees handled SIGALRM the same way as SIGTERM)

Actually I don't read this as "walltime doesn't need bees to implement" ;-)

Jan 14 '19 20:01 kakra

OK, I accept that rationale for --oneshot / -1.

My point was that you could just set up a cron job or external timer to make sure bees doesn't take too long to run (or that you didn't give it enough time to catch up). I'm trying to avoid creeping feature requests like "now that we have --max-time, could it take an absolute time/date argument in the form "next tuesday at 7", but in German?" ;)

If it's something bees can do better because it's bees (like knowing whether it has achieved a crawl-related completion goal) then bees should do it. If it's something any other program on the system can know (like whether 8 hours have passed since some process started) then any other program should do it.

Jan 14 '19 20:01 Zygo

Yes, I think the job manager (whatever it is, cron/systemd/...) should provide infrastructure to signal the process to quit after some time. But we should document an example for this including the proper signal to send.

I think systemd has TimeoutStartSec (oneshot services) and RuntimeMaxSec (forking services) for this but I'm not sure what signal is sent (man 5 systemd.service).

Using the alarm(3600); exec ... sounds like a great idea at the first glance but the man page points out that it may not play well when used together with sleep() as sleep may reset the alarm timer. I'm not sure if bees uses sleep() but it may become a caveat for future code changes then. So what's a good alternative? (man 2 alarm)

I don't think cron has such infrastructure by default, at least vixie-cron doesn't. Not sure about other job managers.

BTW: I can totally follow and understand the feature creep argument. ;-)

Jan 14 '19 21:01 kakra

The perl man page talks about sleep on general Unix platforms, not sleep on Linux. On old crappy Unix, sleep is a library function which sets up a SIGALRM with alarm(2) and then pauses, waiting for a signal to interrupt it. Only one sleep call can be active per process, making it unsuitable for Unix platforms that implement threads. That warning in the Perl manpage is older than most Unix platforms that have threads (including early-90's Linux), to give you an idea how old that warning is.

sleep on Linux does not use SIGALRM, so it can be safely combined with code that does. Most of the time bees sleeps with pthread_cond_timedwait (so the threads can coordinate SIGTERM response), and pthread_cond_timedwait is not allowed to interact with signals. There may be a few nanosleeps still around. Modern libc translates sleep to the equivalent nanosleep system call.

I was thinking have one cron job that starts bees at 22:00, and another one that runs killall bees at 06:00 (or does the equivalent systemd service start/stops). cron (or systemd timers) can do that.

It's a little more messy if you need to track bees instances by pid, and you're not using cgroups so you can't just say "kill everything in the cgroup." Starting it with perl -e ... gets around that issue, or using a program like timeout which deals with the disappearing-pid issue properly.

Jan 14 '19 22:01 Zygo

OK, to summarize the feature request:

add --oneshot (-1) option to run bees until all crawlers are past the current (measured at bees startup time) filesystem transid, then exit
add SIGALRM to the list of signals that terminate bees gracefully, equivalent to SIGTERM and SIGINT

Jan 14 '19 22:01 Zygo

Yeah, I think that's mostly it except:

document an example

@dim-geo Would you agree?

It probably also closes some other requests in the github issues list.

Jan 14 '19 22:01 kakra

Hello, my thoughts:

For option name --oneshot -1 it's fine. Gentoo user here!
For SIGALRM: I dont't have a strong opinionn on this but I don't see the point too much... If I want to terminate bees after some time, I can do it anyway via sending a SIGTERM, after the time has passed.
For the exit condition, I have a minor objection. I would like bees to crawl all subvolumes up to "limit transid". The reason is this: As a end user I would like bees to dedup new snapshots/subvolumes that appeared while it was not running. If it doesn't do it, then it will only crawl 'old/existing' subvolumes. Personally I would like to execute bees always with --oneshot and expect it to dedup almost the same data as if --oneshot was not given. If --oneshot excludes new subvolumes it will never parse some data unless it's called without --oneshot.

Maybe we can simplify the exit condition: record the current filesystem transid at startup, and stop bees when every crawler's min_transid is equal to or above that transid. That condition doesn't care about new subvols, individual crawler lifetimes, changing data, or interactions with bees itself--it just stops bees when all crawlers have caught up with the filesystem as it was at the moment bees started up.

Jan 14 '19 23:01 dim-geo

@dim-geo I think "doesn't care about new subvols..." doesn't mean it ignores it. It just means it doesn't matter whether new transid was created by any of those items: Bees would just crawl the whole fs up to the current transid at bees startup time.

Jan 14 '19 23:01 kakra

My proposal so far is to stop bees when transid_min() returns a value that is equal or larger to what transid_max() returned when bees started up. So everything bees does runs normally, except that bees terminates itself when the last subvol crawler crosses the transid threshold.

SIGTERM handling preserves the crawl state, so everything will resume exactly at the point it stopped on the next bees run. Any new data (including new subvols) that appeared while bees was not running will be picked up when the next run starts.

Termination might not occur when bees is idle. This will actually be at least one complete filesystem scan. Any partially-scanned subvols will be completed before scan that counts as "one pass" begins. Any subvols that are in mid-scan when the last subvol crosses the threshold will be interrupted (provided they did at least one scan first). Any subvol crawlers that complete their scan will restart immediately if there is more data.

One possible variation on this is that we suppress the crawl-restart and/or new-subvol-detection behaviors while --oneshot is active (or maybe have an additional flag to control this separately). In this variation, new subvols would only be added during startup (i.e. the next bees startup when we read the beescrawl.dat). Crawlers that reach the target transid would drop out, freeing more resources to finish the oneshot faster.

Jan 14 '19 23:01 Zygo

I think the latter variation is what most people want who use this feature. Tho, the former variation would achieve a better throughput over time and keep the backlog smaller... But after all, this feature was requested to free up resources as early as possible (which may hunt you for the next run, tho).

Any opinions which would be preferred? @dim-geo

Jan 14 '19 23:01 kakra

For me --oneshot should crawl at least the new subvolumes it found when it started. For subvolumes appearing during execution, they can be ignored. In reality I think it's better to ignore them because they might be temporary. But again if you think they should be parsed, I don't object :)

Jan 14 '19 23:01 dim-geo

Ah, the word "temporary" is a good point. :-)

Jan 15 '19 00:01 kakra

Yeah, some mechanism (that is less crude than the hammer that is --workaround-btrfs-send) for selecting which subvols to ignore might be nice.

I think the goal for many people is to minimize ram_usage * time, either run all the time with a little RAM, or run in very short bursts with lots of RAM (assuming the amount of work done is equal). See also #95 which is aimed at directly reducing the ram-usage-over-time product in a different way.

This is opposed to the original bees design goal, which was to maximize free_space * time without considering variable RAM usage. The assumption was that all the RAM and IO you could ever throw at bees wouldn't be enough, and you'd need to be deduping new data immediately in order to avoid running out of space, so there would never be a point where you could use less than the full amounts of everything. This is still true of large filesystems on big file servers, but little desktops and laptops are idle 90% of the time, and the smaller systems will always outnumber the bigger ones.

The tradeoffs required for the different goals are unexplored. We'll have to try a few ideas (mostly on/off switches for things bees already does) and see what sticks.

Jan 15 '19 03:01 Zygo

While I'm running with only 75% of my usual amount on RAM (due to a bit-flip error in one memory module), I discovered that RAM is a very precious resource when running btrfs. The system often comes to a complete halt for seconds now, probably blocking on the writeback threads. So I can follow why this is a requested feature: running bees only during maintenance windows...

But let's talk about assumptions: I thought your conclusion from running various benchmarks and tests has been that throwing much small hash tables sizes at bees can increase its performance and dedupe hit ratio (tho it's still unknown why the latter is). So that assumption is perhaps outdated?

But I think it comes down to two options: Systems running bees 24x7 because you have the resources and want to free up storage asap but those systems are probably busy more or less all the time, and systems that can be idle said 90% of the time, most of which falls into night hours or off-desk hours, those system would use the --oneshot switch and start bees only at the beginning of a maintenance window.

So --oneshot would be a good first candidate to switch different behavior defaults of bees without adding a lot of additional micro switches.

Jan 16 '19 03:01 kakra

There are some anomalies with the way hash table organization (total size of table and number of bucket cells per hash) affects dedupe rate. Once you reach a certain size, making the hash table bigger doesn't help any more, and sometimes even makes the dedupe rate slightly worse just before the curve goes totally flat. As far as I can tell the hash table implementation is OK, but bees doesn't (yet) build a full backref map and it may be losing opportunities to dedupe based on which subset of the map it read. If you have a hash table that is very large you hit a maximum dedupe rate, and a hash table that is very small has a poor dedupe rate, but there's a few surprising points in between.

I think we are violently agreeing on --oneshot now. There are several related issues that might come up but I'll wait for someone else to notice them. ;)

Jan 16 '19 05:01 Zygo

While I'm running with only 75% of my usual amount on RAM (due to a bit-flip error in one memory module), I discovered that RAM is a very precious resource when running btrfs. The system often comes to a complete halt for seconds now, probably blocking on the writeback threads.

I run into that issue on fairly large RAM machines with heavy write workloads. Oddly, I find that reducing the amount of available RAM helps with latency, e.g. put all the rsync processes into a cgroup, then set that cgroup's memory.limit_in_bytes to 1GB or 5% of RAM, whichever is smaller. This throttles the creation of dirty pages and limits the total amount of data to write on each commit, so one rsync process doesn't take down the whole system. Even bees seems to benefit (or the whole rest of the system seems not to suffer) from being limited to only a GB or two more than its hash table size.

All the limit really does is keep one or two processes from flooding the entire system, so that btrfs can use the rest of the RAM for...whatever btrfs needs gigabytes of RAM for. If you're really low on memory, then pretending you have less RAM doesn't help so much.

Jan 16 '19 15:01 Zygo

Oddly, I find that reducing the amount of available RAM helps with latency, e.g. put all the rsync processes into a cgroup, then set that cgroup's memory.limit_in_bytes to 1GB or 5% of RAM, whichever is smaller.

This is exactly what I'm doing, I already locked the browsers into cgroups: https://github.com/kakra/gentoo-cgw

But it also occurs with browsers closed. I'm yet to find out which process creates those write bursts. It mostly occurs when a foregroup process is active which only reads and which I cannot just switch to background (read: a game). Parallel htop confirms only reads, then suddenly all HD activity freezes and blocking processes with IO at that moment, after a few seconds I see write activity bursts from multiple processes at once, then everything continues normally.

Somehow I think this is swapping out stuff: An allocation forces memory out to swap, the kernel blocks on IO from various processes (even htop freezes), when it's done, piled-up writeback starts for multiple processes.

Turning the knobs for dirty ratios doesn't really help.

I don't find it odd that forcing smaller writeback buffers on processes improves the situation: Smaller buffers mean better latency (but a the cost of less throughput due to less potential for optimizing write order and coalescing write reqs). But yes, btrfs needs at least 2-3 GB of cache RAM, otherwise the system starts to stutter a lot, and below 1 GB very visible freezes occur. It would be more helping if the kernel allowed to configure a minimum cache size it tries to maintain unless it really cannot allocate any more memory. But currently it happily seems to prefer throwing away cache before swapping stuff out, no matter how I set swappiness. I often have a lot of idle processes sitting around - it should be possible to swap those out.

Jan 16 '19 16:01 kakra

I wonder if this results from a problem documented in the kernel as "allocstall"... I changed kswapd watermarks and situation seems to improve.

vm.watermark_scale_factor=200

Its result is that I'm now seeing around 500M-1G of free (as in "unused", not allocated to anything) memory in htop most of the time while previously it may have been below a few MB. Thus, it's sacrificing some memory for improved memory allocation latency. The system feels much smoother under IO load.

I used to have this parameter some months (or more) ago but I'm resetting the sysctl parameters from time to time after a few iterations of kernel upgrades to see if things improved without it.

Jan 24 '19 21:01 kakra

Well, here is my implementation of beesd-oneshot:

#!/bin/bash
set -eEuo pipefail

function log() {
	# shellcheck disable=SC2059
	printf '[beesd-oneshot] '"$1"'\n' "${@:2}" >&2
}

# Stores the result in read_fd and write_fd
function create_pipe() {
	# shellcheck disable=SC2216
	sleep infinity | sleep infinity &
	local pid1 pid2 # read_fd write_fd
	pid2=$!
	pid1=$(jobs -p %+)
	exec {write_fd}>/proc/"$pid1"/fd/1 {read_fd}</proc/"$pid2"/fd/0
	disown "$pid2"
	kill "$pid1" "$pid2"
}

create_pipe ; sync_read_fd=$read_fd ; sync_write_fd=$write_fd
create_pipe ; json_status_read_fd=$read_fd ; json_status_write_fd=$write_fd

args=(
	bwrap
	--dev-bind / /
	--sync-fd "$sync_write_fd"
	--json-status-fd "$json_status_write_fd"
	--as-pid-1
	--unshare-pid
	/usr/bin/beesd "$@"
)
{
	exec {sync_read_fd}<&-
	exec {json_status_read_fd}<&-
	exec "${args[@]}"
} &
# bwrap_pid=$!
exec {sync_write_fd}<&-
exec {json_status_write_fd}<&-

function stop() {
	if [[ -v pid ]] ; then
		log 'Killing bees...'
		kill "$pid" || true
	fi
	log 'Waiting for bees to exit...'
	# pwait "$pid" || true
	cat <&"$sync_read_fd"
	log 'bees exited, exiting.'
}

trap stop EXIT ERR
idle=0
lastio=''
while true
do
	sleep 1

	while read -t 0 -u "$json_status_read_fd"
	do
		read -u "$json_status_read_fd" -r line
		log 'Got JSON status line: %s' "$line"
		if [[ "$line" == *'"child-pid"'* ]] ; then
			pid=$(jq -r '.["child-pid"]' <<<"$line")
			log 'Got PID: %d' "$pid"
		fi
		if [[ "$line" == *'"exit-code"'* ]] ; then
			exit_code=$(jq -r '.["exit-code"]' <<<"$line")
			log 'Child exited with status: %d' "$exit_code"
			break
		fi
	done

	if [[ ! -v pid ]]
	then
		log 'bees not yet started...'
	elif read -r _ _ state _ < /proc/"$pid"/stat && io=$(grep '^read_bytes:' /proc/"$pid"/io)
	then
		if [[ "$state" != "S" ]] ; then
			idle=0 ; log 'State is %q.' "$state"
		elif [[ "$io" != "$lastio" ]] ; then 
			idle=0 ; log 'I/O occurred (%s).' "${io//$'\n'/ | }"
		else
			idle=$((idle+1)) ; log 'Idle for %d seconds...' "$idle"
			if [[ "$idle" -ge 15 ]] ; then log 'bees seems idle for %d seconds, stopping.' "$idle" ; break ; fi
		fi
		lastio="$io"
	else
		log 'Process is gone, stopping.'
		break
	fi
done

It uses bwrap to encapsulate all child processes (so that it knows when bees fully exits) and for the mount namespace (so the mount that beesd creates is cleaned up automatically; normally systemd does this).

According to my observations, it seems to work.

Apr 09 '23 20:04 CyberShadow

would it be safe to run a non-forking instance of bees wrapped in timeout(1)? seems simpler than alarm tricks, unless I'm missing something (timeout uses SIGTERM by default, but it's configurable).

(the reason I'm asking is bees interferes with media playback for me, even when treated with --threadcount 1 and all the scheduler tricks in systemd's toolbox, so I'd like to run it on a schedule at night instead of having it constantly run in background. I don't care too much about how much progress it makes at a time, as long as it makes some progress)

Apr 16 '23 10:04 cmm

It is safe. In the worst case, some minutes of activity won't be stored in the status file thus bees would repeat some of its efforts at the beginning of the next run.

Apr 16 '23 13:04 kakra