bpftime
bpftime copied to clipboard
[Query] About the performance of running malloc test cases using bpftime
Hello! While reading your excellent tutorial (eBPF Tutorial by Example 16: Monitoring Memory Leaks), I came across the bpftime project. And indeed, I encountered what the tutorial mentioned, that memleak has a significant impact on the performance of the target monitoring program, so I want to use bpftime to try and reduce the impact on the target monitoring program. However, when I ran the test after compiling, I found that there seems to be no obvious optimization effect? I would like to know if this is a normal situation?Or maybe it's a problem caused by me making compatibility changes to the code during compilation?
background:
- My target test program is example/malloc/victim in the project. However, I modified it to make malloc run at a high frequency.
// victim.c
int main(int argc, char *argv[]) {
while(1) {
void *p = malloc(1024);
printf("continue malloc...\n");
**usleep(10); // here ! make malloc run at a high frequency.**
free(p);
}
return 0;
}
- I use the pidstat tool to monitor the CPU usage of the victim program.
- I use the malloc program in the example to monitor.
- The command I use is:
sudo ~/.bpftime/bpftime load example/malloc/malloc
sudo ~/.bpftime/bpftime start example/malloc/victim
The resulting performance is as follows:
-
This is when only the victim program is running without malloc monitoring:
-
This is when malloc is used directly to monitor the victim program:
-
This is when running malloc using bpftime to monitor the victim program:
You can find that the total CPU usage of the victim when running with bpftime is the highest. Did I make a mistake?
In addition, I also tried the memleak tool. My memleak program is given in the tutorial.
There was no difference in total CPU usage.Is there something wrong with my testing method?
I look forward to your time to answer my questions, thank you!
By the way, I ran uprobe in Benchmark and got the following result:
| project | kernel_uprobe (ns) | userspace_uprobe (ns) | Performance improvement factor |
|---|---|---|---|
| __bench_uprobe_uretprobe | 3676.24 | 336.05 | 11.0x |
| __bench_uretprobe | 3486.93 | 310.24 | 11.2x |
| __bench_uprobe | 2849.77 | 307.5 | 9.3x |
| __bench_read | 53492.53 | 3566.44 | 15.0x |
| __bench_write | 60064.95 | 3578.79 | 16.8x |
| __bench_hash_map_update | 81125.41 | 21200.8 | 3.8x |
| __bench_hash_map_lookup | 30162.8 | 18208.51 | 1.7x |
| __bench_hash_map_delete | 40944.89 | 9621.9 | 4.3x |
| __bench_array_map_update | 19973.82 | 7364.86 | 2.7x |
| __bench_array_map_lookup | 3897.34 | 4395.09 | 0.9x ⚠️ |
| __bench_array_map_delete | 8726.37 | 4482.74 | 1.9x |
| __bench_per_cpu_hash_map_update | 72953.73 | 96343.53 | 0.8x ⚠️ |
| __bench_per_cpu_hash_map_lookup | 37735.07 | 64479.96 | 0.6x ⚠️ |
| __bench_per_cpu_hash_map_delete | 40969.17 | 104821.19 | 0.4x ⚠️ |
| __bench_per_cpu_array_map_update | 19454.74 | 21909.75 | 0.9x ⚠️ |
| __bench_per_cpu_array_map_lookup | 8774.57 | 11214.84 | 0.8x ⚠️ |
| __bench_per_cpu_array_map_delete | 8733.31 | 9799.09 | 0.9x ⚠️ |
Hi @Yifei-Zhang-Deeproute, thanks for the detailed report and for trying both the tutorial memleak and the example/malloc pair.
Short answer: bpftime reduces per-probe overhead (you already saw 9–16× on the uprobe microbenchmarks), but your victim changes (printf + usleep(10)) introduce very heavy syscall/context-switch load that dwarfs the uprobe cost. With that pattern, total %CPU of the victim can look higher with bpftime even if the probe itself is cheaper.
Why this happens:
usleep(10)⇒ ~100k sleeps/sec → 100k syscalls + context switches/sec. That dominates time accounting inpidstat/topregardless of probe engine.printf("continue malloc...\n")flushes often and contends on stdio locks. That adds I/O and scheduler noise unrelated to uprobe cost.- bpftime moves the probe handler to userspace, so its cycles “count” toward user time rather than kernel time.
%CPUcan rise even when latency per call falls.
You can see the expected effect in our docs/bench: user-space uprobes show ~9–16× lower overhead than kernel uprobes on pure attach/handler loops (your own table matches that).
How to make the comparison fair:
- Remove cross-traffic in the victim
// victim_nocontention.c
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
static inline uint64_t ns() {
struct timespec t; clock_gettime(CLOCK_MONOTONIC, &t);
return (uint64_t)t.tv_sec*1000000000ull + t.tv_nsec;
}
int main() {
for (volatile int i=0; i<50*1000*1000; ++i) { // tight loop
void *p = malloc(1024);
free(p);
}
// no printf, no usleep
}
- No
printf, nousleep. This keeps the syscall rate and scheduler effects out of the measurement.
- Measure per-call latency and system effects