bpftime [Query] About the performance of running malloc test cases using bpftime

Hello! While reading your excellent tutorial (eBPF Tutorial by Example 16: Monitoring Memory Leaks), I came across the bpftime project. And indeed, I encountered what the tutorial mentioned, that memleak has a significant impact on the performance of the target monitoring program, so I want to use bpftime to try and reduce the impact on the target monitoring program. However, when I ran the test after compiling, I found that there seems to be no obvious optimization effect? I would like to know if this is a normal situation?Or maybe it's a problem caused by me making compatibility changes to the code during compilation?

background:

My target test program is example/malloc/victim in the project. However, I modified it to make malloc run at a high frequency.

// victim.c
int main(int argc, char *argv[]) {
    while(1) {
        void *p = malloc(1024);
        printf("continue malloc...\n");
        **usleep(10); // here ! make malloc run at a high frequency.**
        free(p);
    }
    return 0;
}

I use the pidstat tool to monitor the CPU usage of the victim program.
I use the malloc program in the example to monitor.
The command I use is:

sudo ~/.bpftime/bpftime load example/malloc/malloc

sudo ~/.bpftime/bpftime start example/malloc/victim

The resulting performance is as follows:

This is when only the victim program is running without malloc monitoring：
This is when malloc is used directly to monitor the victim program：
This is when running malloc using bpftime to monitor the victim program：

You can find that the total CPU usage of the victim when running with bpftime is the highest. Did I make a mistake?

In addition, I also tried the memleak tool. My memleak program is given in the tutorial.

There was no difference in total CPU usage.Is there something wrong with my testing method?

I look forward to your time to answer my questions, thank you!

Sep 18 '25 11:09 Yifei-Zhang-Deeproute

By the way, I ran uprobe in Benchmark and got the following result:

project	kernel_uprobe (ns)	userspace_uprobe (ns)	Performance improvement factor
__bench_uprobe_uretprobe	3676.24	336.05	11.0x
__bench_uretprobe	3486.93	310.24	11.2x
__bench_uprobe	2849.77	307.5	9.3x
__bench_read	53492.53	3566.44	15.0x
__bench_write	60064.95	3578.79	16.8x
__bench_hash_map_update	81125.41	21200.8	3.8x
__bench_hash_map_lookup	30162.8	18208.51	1.7x
__bench_hash_map_delete	40944.89	9621.9	4.3x
__bench_array_map_update	19973.82	7364.86	2.7x
__bench_array_map_lookup	3897.34	4395.09	0.9x ⚠️
__bench_array_map_delete	8726.37	4482.74	1.9x
__bench_per_cpu_hash_map_update	72953.73	96343.53	0.8x ⚠️
__bench_per_cpu_hash_map_lookup	37735.07	64479.96	0.6x ⚠️
__bench_per_cpu_hash_map_delete	40969.17	104821.19	0.4x ⚠️
__bench_per_cpu_array_map_update	19454.74	21909.75	0.9x ⚠️
__bench_per_cpu_array_map_lookup	8774.57	11214.84	0.8x ⚠️
__bench_per_cpu_array_map_delete	8733.31	9799.09	0.9x ⚠️

Sep 19 '25 06:09 Yifei-Zhang-Deeproute

Hi @Yifei-Zhang-Deeproute, thanks for the detailed report and for trying both the tutorial memleak and the example/malloc pair.

Short answer: bpftime reduces per-probe overhead (you already saw 9–16× on the uprobe microbenchmarks), but your victim changes (printf + usleep(10)) introduce very heavy syscall/context-switch load that dwarfs the uprobe cost. With that pattern, total %CPU of the victim can look higher with bpftime even if the probe itself is cheaper.

Why this happens:

usleep(10) ⇒ ~100k sleeps/sec → 100k syscalls + context switches/sec. That dominates time accounting in pidstat/top regardless of probe engine.
printf("continue malloc...\n") flushes often and contends on stdio locks. That adds I/O and scheduler noise unrelated to uprobe cost.
bpftime moves the probe handler to userspace, so its cycles “count” toward user time rather than kernel time. %CPU can rise even when latency per call falls.

You can see the expected effect in our docs/bench: user-space uprobes show ~9–16× lower overhead than kernel uprobes on pure attach/handler loops (your own table matches that).

How to make the comparison fair：

Remove cross-traffic in the victim

// victim_nocontention.c
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

static inline uint64_t ns() {
  struct timespec t; clock_gettime(CLOCK_MONOTONIC, &t);
  return (uint64_t)t.tv_sec*1000000000ull + t.tv_nsec;
}

int main() {
  for (volatile int i=0; i<50*1000*1000; ++i) { // tight loop
    void *p = malloc(1024);
    free(p);
  }
  // no printf, no usleep
}

No printf, no usleep. This keeps the syscall rate and scheduler effects out of the measurement.

Measure per-call latency and system effects

Sep 23 '25 03:09 yunwei37

bpftime bpftime copied to clipboard

[Query] About the performance of running malloc test cases using bpftime

bpftime
bpftime copied to clipboard