pcp
pcp copied to clipboard
pmdabpf sets a too-low limit for rlimit
Making the same mistake as 640KB ought to be enough for anybody, I picked a 100MB rlimit setting during bpf.c setup. Now, when another (non-PCP) bpf application is loaded, we have a conflict.
Proposal:
- If the existing memlock rlimit value is greater than the 100MB target, do not try to change it.
- Alternatively, we could simply ignore rlimit and document the failure / push it to user responsibility (eg: systemd unit).
Reproduce:
- Start other bpf applications that use > 100MB of space
- Set ulimit -l unlimited for the root user
- Start dbpmda and open the pmda
-
- The PMDA sets rlimit 100MB
- dbpmda fails to start
With another program running, consuming ~350MB of maps and progs
(root) /home/user # ulimit -l unlimited
(root) /home/user # dbpmda
dbpmda> open dso /usr/lib/pcp/pmdas/bpf/pmda_bpf.so bpf_init 157
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading caches
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading configuration: /var/lib/pcp/pmdas/bpf/bpf.conf
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loaded configuration: /var/lib/pcp/pmdas/bpf/bpf.conf
[Wed Mar 13 18:34:52] dbpmda(4849) Info: setrlimit RMLIMIT_MEMLOCK ok
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading modules
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading modules
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading runqlat.so
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading: runqlat.so from /var/lib/pcp/pmdas/bpf/modules/runqlat.so
[Wed Mar 13 18:34:52] dbpmda(4849) Info: initialising runqlat.so
[Wed Mar 13 18:34:52] dbpmda(4849) Info: booting: runqlat_bpf
[Wed Mar 13 18:34:52] dbpmda(4849) Warning: libbpf: Error in bpf_object__probe_loading():Operation not permitted(1). Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough value.
[Wed Mar 13 18:34:52] dbpmda(4849) Warning: libbpf: failed to load object 'runqlat_bpf'
[Wed Mar 13 18:34:52] dbpmda(4849) Warning: libbpf: failed to load BPF skeleton 'runqlat_bpf': -1
[Wed Mar 13 18:34:52] dbpmda(4849) Error: bpf load failed: -1, Operation not permitted
...
In this case, bpftrace also shows that bpf charge is failing:
# sudo bpftrace -e 'kretprobe:__bpf_prog_charge /comm == "dbpmda"/ { printf("%d\n", retval); }'
Attaching 1 probe...
-1
-1
-1
-1
because 85322 pages (350MB) are already counted against the user, and the current limit is 25600 pages (100MB).
# sudo bpftrace -e 'kprobe:__bpf_prog_charge /comm == "dbpmda"/ { printf("user_struct->locked_vm.counter=%d pagesu_normal_store
=%d curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=%d\n", ((struct user_struct *)arg0)->locked_vm.counter, arg1, (curtask->signal->rliu_panic
m[8].rlim_cur >> 12) ); }'
Attaching 1 probe...
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
For what it's worth, in 5.11+ kernels this is changed to cgroup accounting, so the effect is only apparent on older kernels which also have the pmdabpf built (ie: this might be something only we are seeing on legacy OS distributions).
@jasonk000 What was the original goal of setting the MEMLOCK limit in pmdabpf? (making sure we can lock down space we need but nothing extremely large, I guess?)
Another option might be to look at physical memory size (sysconf(3) _SC_PHYS_PAGES or _SC_AVPHYS_PAGES) and then calculate a value to ensure we set a lock limit that's a fraction of what is available as physical memory on the machine (e.g. no more than a 10th - some heuristic, anyway - perhaps clamped to a minimum value of 100MB?)
Another option could be to say we trust ourselves, and anyone configuring pmdabpf modules must already have root access, so remove all limits for the pmdabpf process (RLIM_INFINITY?)?
Background context -- In the original stages, before the memory using resources were accounted against the cgroup, they were tracked and limited by using memlock. This is relevant for maps, since you can specify arbitrary sized maps which can be very big (we have some maps that are defined as over 100MB). As a common approach, many libbpf users take a setrlimit approach because it was a more clear error scenario than to fail with EPERM (since bpf normally runs as root, a setrlimit to a high enough value is nearly always going to work). With newer kernels (5.11+) this is no longer necessary since the limiting is applied at cgroup rather than at rlimit. I don't think there's a lot of value picking some arbitrary value because it will always be either too-high, or it will be too-low, unless the operator knows the app they are using.
Personally I think of a couple of approaches, infinity works. What I am currently doing on private build is using systemd to set LimitMEMLOCK=infinity on pmcd.service and then commented out the rlimit code and relying on the operator and systemd to enforce controls.. I think this is the "right" answer for systemd use cases, but I'm not familiar with other service management platforms.
OK, should we just change bpf_setrlimit to unilaterally set RMLIMIT_MEMLOCK rlim_cur/max to infinity then? That's simpler than the current code in there and means no systemd changes are needed for everyone else...
$ git diff .
src/pmdas/bpf/bpf.c
--- /tmp/git-blob-eMifIb/bpf.c 2024-03-14 15:01:22.590938902 +1100
+++ src/pmdas/bpf/bpf.c 2024-03-14 15:01:14.394923471 +1100
@@ -127,33 +127,19 @@ int bpf_printfn(enum libbpf_print_level
return 0;
}
-#define WANT_MEM (100LL*1024*1024)
/**
- * setrlimit required for BPF loading ... try for WANT_MEM, but will
- * accept whatever we can get
+ * setrlimit required for BPF loading ... try for infinite memlocking
+ * as we have no idea how much we will really need (it depends on all
+ * loaded eBPF programs).
*/
void bpf_setrlimit()
{
- struct rlimit rnew;
- int ret;
- ret = getrlimit(RLIMIT_MEMLOCK, &rnew);
- if (ret < 0) {
- pmNotifyErr(LOG_ERR, "bpf_setrlimit: getrlimit RMLIMIT_MEMLOCK failed: %s", pmErrStr(-errno));
- return;
- }
- if (rnew.rlim_max == RLIM_INFINITY)
- rnew.rlim_cur = rnew.rlim_max = WANT_MEM;
- else if (rnew.rlim_max > WANT_MEM)
- rnew.rlim_cur = rnew.rlim_max = WANT_MEM;
- else {
- rnew.rlim_cur = rnew.rlim_max;
- pmNotifyErr(LOG_INFO, "bpf_setrlimit: setrlimit RMLIMIT_MEMLOCK %lld not %lld", (long long)rnew.rlim_cur, WANT_MEM);
- }
- ret = setrlimit(RLIMIT_MEMLOCK, &rnew);
- if (ret == 0)
+ struct rlimit rnew = { RLIM_INFINITY, RLIM_INFINITY };
+
+ if (setrlimit(RLIMIT_MEMLOCK, &rnew) == 0)
pmNotifyErr(LOG_INFO, "setrlimit RMLIMIT_MEMLOCK ok");
else
- pmNotifyErr(LOG_ERR, "setrlimit RMLIMIT_MEMLOCK (%lld,%lld) failed: %s", (long long)rnew.rlim_cur, (long long)rnew.rlim_max, pmErrStr(-errno));
+ pmNotifyErr(LOG_ERR, "setrlimit RMLIMIT_MEMLOCK (infinity) failed: %s", pmErrStr(-errno));
}
/**
I think that works too, so long as we do not bail on error from setrlimit(). I do wonder how many supported environments are building pmdabpf for <= linux 5.11. It might be better to make this dead code and remove it, (which is why I leaned to removing code + systemd unit, less to maintain..).
OK, removing it entirely works for me too - especially if we can make the systemd unit conditional on kernel version, or even put the responsibility of managing that onto the user entirely (maybe we could just document it in the man page?).
I have started to take a look at this issue. I have some questions:
Any suggestions on ready-to-run bpftrace script (or other bpf-based tooling) that would use more than the 100MB? In the proposed patch from March 14, shouldn't the RMLIMIT_MEMLOCK be RLIMIT_MEMLOCK?
I have started to take a look at this issue. I have some questions:
Awesome!
Any suggestions on ready-to-run bpftrace script (or other bpf-based tooling) that would use more than the 100MB?
I think you could use something like bpftool(8), specifically bpftool map create. I haven't tested this but I think it should work. Not sure if you would need to also insert entries. Let me know if you need a specific tool/script.
In the proposed patch from March 14, shouldn't the RMLIMIT_MEMLOCK be RLIMIT_MEMLOCK?
Agree!
Early comments mention "in 5.11+ kernels this is changed to cgroup accounting, so the effect is only apparent on older kernels". Current fedora kernels are much more recent and RHEL9 has linux 5.14 kernel so they shouldn't be affected. RHEL8 has a 4.18 kernel, but its libbpf-0.5.0 is too old to build pcp with the bpf pmda. Know of a "Goldilock" distro with a old enough kernel but new enough libbpf to allow testing?
@wcohen we use a vendored libbpf/bpftool in PCP, code below vendor/github.com/libbpf (@stanfordcox was most recent person to update this IIRC, he may have more details for you).
We've hit this issue with the Ubuntu LTS releases and their rolling kernel releases. We also backport libbpf at times so it's not fresh in my mind how reproducible this situation is on mainstream configurations.
I am a bit leary about carte blanc setting RLIMIT to RLIM_INFINITY as pcp git commit c35de633682e4edfa46898e71e2b7468232729fd by @kmcdonell below clamps the limit to 100MB or less to allow the tests to run on small memory QA machine:
Author: Ken McDonell [email protected] 2022-07-06 20:32:03 Committer: Ken McDonell [email protected] 2022-07-06 20:32:03 Parent: bdb3abdfdbe131f1e8b901c8484966b08bd38e17 (src/pmlogger/src/pass0.c: correct typo in pmLookupDescs() diagnostic message) Child: a81bf9f172076ebc323228f5bd9c5791320c374f (qa/admin/package-lists/*: add bcc-tools or bpfcc-tools) Branches: master, remotes/origin/master, remotes/origin/wcohen/eBPF_command_1900, remotes/origin/wcohen/qa1900_nodebuginfo, remotes/origin/wcohen/valgrind_suppress1900, remotes/origin/wcohen/valgrind_suppress1900a, wcohen/1915, wcohen/eBPF_command_1900, wcohen/qa1900_nodebuginfo, wcohen/rebase, wcohen/rebase2, wcohen/rebase3, wcohen/valgrind_suppress1900, wcohen/valgrind_suppress1900a Follows: 3.12.2 Precedes:
src/pmdas/bpf/bpf.c: rework setrlimit() logic
Set RLIMIT_MEMLOCK to min of 100Mb or the current rlim_max
(was failing on some small memory QA hosts when the requested
value was too big).
Also clean up the diagnostics a bit in this section of the code.
@wcohen I am not sure the circumstances that triggered the 100Mb clamping still apply ... lots of things have moved on in the 2 years since that commit.
I'd vote for letting the RLIM_INFINITY change go, and if we have fallout in QA or elsewhere, a better fix would be to support a new $PCP_BPFPMDA_MEMLOCK (or similar) environment variable that can be used to tune the maximum for cases (like QA) where RLIM_INFINITY may not be appropriate.
Else we can just add support for $PCP_BPFPMDA_MEMLOCK now as it is a trivial extension to Nathan's patch, and wait to use it if needed.