bpftime icon indicating copy to clipboard operation
bpftime copied to clipboard

[Query] About the performance of running malloc test cases using bpftime

Open Yifei-Zhang-Deeproute opened this issue 2 months ago • 2 comments

Hello! While reading your excellent tutorial (eBPF Tutorial by Example 16: Monitoring Memory Leaks), I came across the bpftime project. And indeed, I encountered what the tutorial mentioned, that memleak has a significant impact on the performance of the target monitoring program, so I want to use bpftime to try and reduce the impact on the target monitoring program. However, when I ran the test after compiling, I found that there seems to be no obvious optimization effect? I would like to know if this is a normal situation?Or maybe it's a problem caused by me making compatibility changes to the code during compilation?

background:

  1. My target test program is example/malloc/victim in the project. However, I modified it to make malloc run at a high frequency.
// victim.c
int main(int argc, char *argv[]) {
    while(1) {
        void *p = malloc(1024);
        printf("continue malloc...\n");
        **usleep(10); // here ! make malloc run at a high frequency.**
        free(p);
    }
    return 0;
}
  1. I use the pidstat tool to monitor the CPU usage of the victim program.
  2. I use the malloc program in the example to monitor.
  3. The command I use is:
sudo ~/.bpftime/bpftime load example/malloc/malloc

sudo ~/.bpftime/bpftime start example/malloc/victim

The resulting performance is as follows:

  1. This is when only the victim program is running without malloc monitoring: Image

  2. This is when malloc is used directly to monitor the victim program: Image

  3. This is when running malloc using bpftime to monitor the victim program: Image

You can find that the total CPU usage of the victim when running with bpftime is the highest. Did I make a mistake?


In addition, I also tried the memleak tool. My memleak program is given in the tutorial.

There was no difference in total CPU usage.Is there something wrong with my testing method? Image

I look forward to your time to answer my questions, thank you!

Yifei-Zhang-Deeproute avatar Sep 18 '25 11:09 Yifei-Zhang-Deeproute

By the way, I ran uprobe in Benchmark and got the following result:

project kernel_uprobe (ns) userspace_uprobe (ns) Performance improvement factor
__bench_uprobe_uretprobe 3676.24 336.05 11.0x
__bench_uretprobe 3486.93 310.24 11.2x
__bench_uprobe 2849.77 307.5 9.3x
__bench_read 53492.53 3566.44 15.0x
__bench_write 60064.95 3578.79 16.8x
__bench_hash_map_update 81125.41 21200.8 3.8x
__bench_hash_map_lookup 30162.8 18208.51 1.7x
__bench_hash_map_delete 40944.89 9621.9 4.3x
__bench_array_map_update 19973.82 7364.86 2.7x
__bench_array_map_lookup 3897.34 4395.09 0.9x ⚠️
__bench_array_map_delete 8726.37 4482.74 1.9x
__bench_per_cpu_hash_map_update 72953.73 96343.53 0.8x ⚠️
__bench_per_cpu_hash_map_lookup 37735.07 64479.96 0.6x ⚠️
__bench_per_cpu_hash_map_delete 40969.17 104821.19 0.4x ⚠️
__bench_per_cpu_array_map_update 19454.74 21909.75 0.9x ⚠️
__bench_per_cpu_array_map_lookup 8774.57 11214.84 0.8x ⚠️
__bench_per_cpu_array_map_delete 8733.31 9799.09 0.9x ⚠️

Yifei-Zhang-Deeproute avatar Sep 19 '25 06:09 Yifei-Zhang-Deeproute

Hi @Yifei-Zhang-Deeproute, thanks for the detailed report and for trying both the tutorial memleak and the example/malloc pair.

Short answer: bpftime reduces per-probe overhead (you already saw 9–16× on the uprobe microbenchmarks), but your victim changes (printf + usleep(10)) introduce very heavy syscall/context-switch load that dwarfs the uprobe cost. With that pattern, total %CPU of the victim can look higher with bpftime even if the probe itself is cheaper.

Why this happens:

  • usleep(10) ⇒ ~100k sleeps/sec → 100k syscalls + context switches/sec. That dominates time accounting in pidstat/top regardless of probe engine.
  • printf("continue malloc...\n") flushes often and contends on stdio locks. That adds I/O and scheduler noise unrelated to uprobe cost.
  • bpftime moves the probe handler to userspace, so its cycles “count” toward user time rather than kernel time. %CPU can rise even when latency per call falls.

You can see the expected effect in our docs/bench: user-space uprobes show ~9–16× lower overhead than kernel uprobes on pure attach/handler loops (your own table matches that).

How to make the comparison fair:

  1. Remove cross-traffic in the victim
// victim_nocontention.c
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

static inline uint64_t ns() {
  struct timespec t; clock_gettime(CLOCK_MONOTONIC, &t);
  return (uint64_t)t.tv_sec*1000000000ull + t.tv_nsec;
}

int main() {
  for (volatile int i=0; i<50*1000*1000; ++i) { // tight loop
    void *p = malloc(1024);
    free(p);
  }
  // no printf, no usleep
}
  • No printf, no usleep. This keeps the syscall rate and scheduler effects out of the measurement.
  1. Measure per-call latency and system effects

yunwei37 avatar Sep 23 '25 03:09 yunwei37