core: add quota support for State, Cache, and Log exec directories
Based on https://github.com/systemd/systemd/issues/7820, add support for quota enforcement to State, Cache, and Log exec directories.
- Add new directives, StateDirectoryQuota, CacheDirectoryQuota and LogDirectoryQuota, to define quotas as percentages (hard limits for blocks and inodes) or absolute values (hard limits for blocks only).
- Add new directives, StateDirectoryQuotaAccounting, CacheDirectoryQuotaAccounting and LogDirectoryQuotaAccounting to keep track of storage quotas but not enforce them (effectively just assigning a project ID to defined exec directories).
Example:
StateDirectory=quotadir
StateDirectoryQuota=1%
Jan 06 22:55:46 abeltran: Storage quotas set for /var/lib/private/quotadir. Block limit = 2639404, inode limit = 671088
root@abeltran:/var/lib/private# lsattr -pR
3153000189 --------------e----P-- ./quotadir
root@abeltran:/var/lib/private# repquota -P /datadrive
*** Report for project quotas on device /dev/sdc1
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
Project used soft hard grace used soft hard grace
----------------------------------------------------------------------
#0 -- 213200 0 0 4086 0 0
#3153000189 -- 2639404 0 2639404 2 0 671088
I added the quota_fd() syscall wrapper now in https://github.com/systemd/systemd/pull/36012, because I need it for #36010 too.
(until that PR is merged, you could cherry-pick it into your PR too, to make use of it already)
fuzzers are unhappy, e.g.
==2015==ERROR: AddressSanitizer: ABRT on unknown address 0x0000000007df (pc 0x7f737ab0200b bp 0x7ffdf63e5210 sp 0x7ffdf63e4fc0 T0)
SCARINESS: 10 (signal)
#0 0x7f737ab0200b in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: 0702430aef5fa3dda43986563e9ffcc47efbd75e)
#1 0x7f737aae1858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858) (BuildId: 0702430aef5fa3dda43986563e9ffcc47efbd75e)
#2 0x7f737b51ae39 in log_assert_failed /work/build/../../src/systemd/src/basic/log.c:996:9
#3 0x7f737b54042c in safe_atollu_full /work/build/../../src/systemd/src/basic/parse-util.c:496:9
#4 0x7f737c15d860 in safe_atollu /work/build/../../src/systemd/src/basic/parse-util.h:81:16
#5 0x7f737c15d860 in safe_atou64 /work/build/../../src/systemd/src/basic/parse-util.h:86:16
#6 0x7f737c15d860 in exec_context_deserialize /work/build/../../src/systemd/src/core/execute-serialize.c:2928:29
#7 0x7f737c153805 in exec_deserialize_invocation /work/build/../../src/systemd/src/core/execute-serialize.c:3948:13
#8 0x557c6f32d37b in exec_fuzz_one /work/build/../../src/systemd/src/core/fuzz-execute-serialize.c:47:16
#9 0x557c6f32d37b in LLVMFuzzerTestOneInput /work/build/../../src/systemd/src/core/fuzz-execute-serialize.c:86:9
so there's one more complication: quota is not just about enforcing limits, but also about accounting current use. and that should be something we can enable separately from actual limits.
Or in other words, we need StateDirectoryAccounting=, CacheDirectoryAccounting=, LogDirectoryAccounting= as booleans. And that should assign a projid, but should no enforce any limit on their own.
(And there probably should be DefaultStateDirectoryAccounting= in system.conf and so on. but we can do this later)
or in other words, there this should copy the concepts we have for CPU accounting for example, where there's also CPUAccounting= and DefaultCPUAccounting=. and we should also have properties that can report current use, and systemctl status should show current disk use just like we do that for cpu accounting.
i think it makes sense to keep things simple for now, hence don't bother with DEfaultStateDirectoryAccounting= for now, unless you want to, and don't bother with new props that report current disk space use, or with the systemctl status. However, I do think StateDirectoryAccounting= itself should exist right-away, since it affects the logic, it's not just an add-on.
Oh, and please add some markdown doc to docs/ that mentions the proj id range we took possession of for this, and describe briefly how it is used. doesn't need to be long, but needs to be there. use UID-GIDS.md as inspiration.
one more thing: let's say one day we have DefaultStateDirectoryAccounting=1, which means that during boot we might within a small time window assign a large number of project ids to a large number of state directories. this creates a chance of races: i.e. we pick a proj id we determine to be free, then assign it to our top-level dir inode. at the very same time some other service that starts picks the same proj id, and figured it was free, but by the time they assign it we already assigned ours too, so suddenly there are two services using the same proj id and all the settings will be garbage.
we should cover for that, and it's relatively easy i think: when picking a proj id, let's first pick it, check via quota if currently unused, if so, assign our dir inode to the proj id, then immediately check again the current quota state of the proj id. it must now report exactly one inode with that proj id. if so, we are fine, but if we see a different number than one, then apparently someone else started using it right the moment we picked it. we should then pick a new proj id, and try again.
i think it makes sense to keep things simple for now, hence don't bother with DEfaultStateDirectoryAccounting= for now, unless you want to, and don't bother with new props that report current disk space use, or with the systemctl status. However, I do think StateDirectoryAccounting= itself should exist right-away, since it affects the logic, it's not just an add-on.
I have added the Accounting directives. I agree that the Default directives and props could be add-ons in subsequent PRs
As a general rule: for all new code alway operate on fds, avoid paths as much as possible. paths are fine as initial entrypoint for things, but once you opened the entrypoint please always continue with fds, i.e. openat(), fchmodat(), fchownat, and chase() and so on, never revert back to regular paths.
CI failures look related. Many networkd tests are failed, which uses a state directory.
@yuwata I fixed the network CI failures. I checked the logs for the new CI failures and don't look related. Could you check as well please?
so we are getting closer, but there's one thing i'd really like to see, just so that we know things work correctly: that "systemctl status" shows current quota use (as well as limit) in its output.
this would require adding a bunch of props that reflect the current quota value. should be fairly easy to implement i guess: when the props are requested go to disk, issue that quotactl syscall and return whatever it says.
in the longer run i think we should make "systemctl set-property" work on these props, because we do that on the cgroups resource mgmt settings (or in other words. support immediate changes without reload), but this can be added later.
Please update dbus document: ninja update-dbus-docs
CIs are unhappy:
../src/core/execute-serialize.c:1882:65: error: unused variable 'value_quota' [-Werror,-Wunused-variable]
_cleanup_free_ char *key_quota = NULL, *value_quota = NULL;
^
1 error generated.
@yuwata could I get another review for this PR?
btw, can you please do some minimal split-up in individual commits. i.e. maybe do the chattr.c changes first, then the quota-util.c changes, then the pid 1 changes, then systemctl changes or so, then another commit for the tests. i.e. 5 commits in total, or so?
btw, can you please do some minimal split-up in individual commits. i.e. maybe do the chattr.c changes first, then the quota-util.c changes, then the pid 1 changes, then systemctl changes or so, then another commit for the tests. i.e. 5 commits in total, or so?
@poettering I did 4, easier to merge pid1 + systemctl into 1
dbus docs need to be updated with ninja -C build update-dbus-docs
if you fix these four little things, should be ready to merge from my pov
Please rebase.
needs a rebase