bpftrace
bpftrace copied to clipboard
features by example
I created this document a while ago when I was evaluating bpftrace, and realized that it may make a useful addition (after cleaning it up a bit more): eg, as docs/feature_examples.txt. Sound ok?
I wrote it to see what was done and not done, and was pleased to find that bpftrace is coming along very well (70% features done, excluding a couple that are minor). I just made a pass of this document and filed tickets for the missing things.
Features by example
Kernel-level features
Dynamic tracing, kernel-level (kprobes)
DONE
# bpftrace -e 'kprobe:sys_nanosleep { printf("sleep by %d\n", tid); }'
Attaching 1 probe...
sleep by 1396
sleep by 3669
sleep by 1396
sleep by 27662
sleep by 3669
^C
Dynamic tracing, kernel-level arguments (kprobes)
DONE
# ./bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Attaching 1 probe...
opening: /proc/cpuinfo
opening: /proc/stat
opening: /proc/diskstats
opening: /proc/stat
opening: /proc/vmstat
[...]
# ./bpftrace -e 'kprobe:do_sys_open { printf("open flags: %d\n", arg2); }'
Attaching 1 probe...
open flags: 557056
open flags: 32768
open flags: 32768
open flags: 32768
[...]
# ./bpftrace -e 'kretprobe:do_sys_open { printf("returned: %d\n", retval); }'
Attaching 1 probe...
returned: 8
returned: 21
returned: -2
returned: 21
[...]
Dynamic tracing, user-level (uprobes)
DONE
# ./bpftrace -e 'uretprobe:/bin/bash:readline { printf("read a line\n"); }'
Attaching 1 probe...
read a line
read a line
read a line
read a line
^C
Dynamic tracing, user-level arguments (uprobes)
DONE
# ./bpftrace -e 'uprobe:/bin/bash:readline { printf("arg0: %d\n", arg0); }'
Attaching 1 probe...
arg0: 19755784
arg0: 19755016
arg0: 19755784
^C
# ./bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc-2.23.so:fopen { printf("fopen: %s\n", str(arg0)); }'
Attaching 1 probe...
fopen: /proc/filesystems
fopen: /usr/share/locale/locale.alias
fopen: /proc/self/mountinfo
^C
# ./bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"\n", str(retval)); }'
Attaching 1 probe...
readline: "echo hi"
readline: "ls -l"
readline: "date"
readline: "uname -r"
^C
Static tracing, kernel-level (tracepoints)
DONE
# ./bpftrace -e 'tracepoint:block:block_rq_insert { printf("block I/O created by %d\n", tid); }'
Attaching 1 probe...
block I/O created by 28922
block I/O created by 3949
block I/O created by 883
block I/O created by 28941
block I/O created by 28941
block I/O created by 28941
[...]
Static tracing, kernel-level arguments (tracepoints)
TODO: see issue #26
Something like this should work (inspired by bcc's trace):
bpftrace -e 'tracepoint:block:block_rq_insert { printf("sectors: %d\n", args->nr_sector); }'
Timed sampling events (BPF with perf_event_open)
DONE
# ./bpftrace -e 'profile:hz:99 { @[tid] = count(); }'
Attaching 1 probe...
^C
@[32586]: 98
@[0]: 579
PMC events (BPF with perf_event_open)
TODO: see issue #27
This is performance monitoring counter events. bcc has an example tool: llcstat. An interface might look like this:
bpftrace -e 'perf:cpu-cycles:10000000 { @[tid] = count(); }'
That would fire once for every 10000000 cpu-cycles.
Filtering (via BPF programs)
DONE
# ./bpftrace -e 'kprobe:sys_read /arg2 < 16/ { printf("small read: %d byte buffer\n", arg2); }'
Attaching 1 probe...
small read: 8 byte buffer
small read: 8 byte buffer
small read: 8 byte buffer
small read: 8 byte buffer
small read: 8 byte buffer
small read: 12 byte buffer
^C
# ./bpftrace -e 'kprobe:sys_read /comm == "bash"/ { printf("read by %s\n", comm); }'
Attaching 1 probe...
read by bash
read by bash
read by bash
read by bash
^C
Per-event output
DONE
# ./bpftrace -e 'kprobe:sys_nanosleep { printf("sleep by %d\n", tid); }'
Attaching 1 probe...
sleep by 3669
sleep by 1396
sleep by 3669
sleep by 1396
[...]
Basic variables (global & per-thread variables via BPF maps, and scratch variables)
DONE
global (@start):
# ./bpftrace -e 'BEGIN { @start = nsecs; }
kprobe:sys_nanosleep /@start != 0/ { printf("at %d ms: sleep\n", (nsecs - @start) / 1000000); }'
Attaching 2 probes...
at 437 ms: sleep
at 647 ms: sleep
at 1098 ms: sleep
at 1438 ms: sleep
at 1648 ms: sleep
^C
@start: 4064438886907216
per-thread (@start[tid]):
# ./bpftrace -e 'kprobe:sys_nanosleep { @start[tid] = nsecs; }
kretprobe:sys_nanosleep /@start[tid] != 0/ { printf("slept for %d ms\n", (nsecs - @start[tid]) / 1000000); @start[tid] = delete(); }'
Attaching 2 probes...
slept for 1000 ms
slept for 1000 ms
slept for 1000 ms
slept for 1009 ms
slept for 2002 ms
[...]
scratch ($delta):
# ./bpftrace -e 'kprobe:sys_nanosleep { @start[tid] = nsecs; }
kretprobe:sys_nanosleep /@start[tid] != 0/ { $delta = nsecs - @start[tid]; printf("slept for %d ms\n", $delta / 1000000); @start[tid] = delete(); }'
Attaching 2 probes...
slept for 1000 ms
slept for 1000 ms
slept for 1000 ms
Associative arrays (via BPF maps)
DONE
The per-thread @start[tid] is an example of this:
# ./bpftrace -e 'kprobe:sys_nanosleep { @start[tid] = nsecs; }
kretprobe:sys_nanosleep /@start[tid] != 0/ { printf("slept for %d ms\n", (nsecs - @start[tid]) / 1000000); @start[tid] = delete(); }'
Attaching 2 probes...
slept for 1000 ms
slept for 1000 ms
slept for 1000 ms
[...]
Frequency counting (via BPF maps)
DONE
# ./bpftrace -e 'kprobe:do_sys_open { @who[comm] = count(); }'
Attaching 1 probe...
^C
@who[sleep]: 3
@who[snmp-pass]: 8
@who[ls]: 9
@who[snmpd]: 29
Histograms (power-of-2 and linear, via BPF maps)
DONE
power-of-2:
# ./bpftrace -e 'kretprobe:sys_read { @bytes = quantize(retval); }'
Attaching 1 probe...
^C
@bytes:
[0, 1] 7 |@@@@@@@@@@@@@ |
[2, 4) 3 |@@@@@ |
[4, 8) 8 |@@@@@@@@@@@@@@ |
[8, 16) 9 |@@@@@@@@@@@@@@@@ |
[16, 32) 0 | |
[32, 64) 1 |@ |
[64, 128) 1 |@ |
[128, 256) 0 | |
[256, 512) 3 |@@@@@ |
[512, 1k) 0 | |
[1k, 2k) 12 |@@@@@@@@@@@@@@@@@@@@@@ |
[2k, 4k) 28 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
power-of-2 by key:
# ./bpftrace -e 'kretprobe:do_sys_open { @bytes[comm] = quantize(retval); }'
Attaching 1 probe...
^C
@bytes[snmp-pass]:
[0, 1] 0 | |
[2, 4) 0 | |
[4, 8) 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@bytes[ls]:
[0, 1] 0 | |
[2, 4) 9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@bytes[snmpd]:
[0, 1] 1 |@@@@ |
[2, 4) 0 | |
[4, 8) 0 | |
[8, 16) 4 |@@@@@@@@@@@@@@@@@@ |
[16, 32) 11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
TODO: linear, see issue #28
Timestamps and time deltas (bpf_ktime_get_ns(), and BPF programs)
DONE
# ./bpftrace -e 'BEGIN { @start = nsecs; }
kprobe:sys_nanosleep /@start != 0/ { printf("at %d ms: sleep\n", (nsecs - @start) / 1000000); }'
Attaching 2 probes...
at 437 ms: sleep
at 647 ms: sleep
at 1098 ms: sleep
at 1438 ms: sleep
^C
Stack traces, kernel (BPF stackmap)
DONE
# ./bpftrace -e 'kprobe:ip_output { @[stack] = count(); }'
Attaching 1 probe...
[...]
@[
ip_output+1
tcp_transmit_skb+1308
tcp_write_xmit+482
tcp_release_cb+225
release_sock+64
tcp_sendmsg+49
sock_sendmsg+48
sock_write_iter+135
__vfs_write+247
vfs_write+179
sys_write+82
entry_SYSCALL_64_fastpath+30
]: 1708
@[
ip_output+1
tcp_transmit_skb+1308
tcp_write_xmit+482
__tcp_push_pending_frames+45
tcp_sendmsg_locked+2637
tcp_sendmsg+39
sock_sendmsg+48
sock_write_iter+135
__vfs_write+247
vfs_write+179
sys_write+82
entry_SYSCALL_64_fastpath+30
]: 9048
@[
ip_output+1
tcp_transmit_skb+1308
tcp_write_xmit+482
tcp_tasklet_func+348
tasklet_action+241
__do_softirq+239
irq_exit+174
do_IRQ+74
ret_from_intr+0
cpuidle_enter_state+159
do_idle+389
cpu_startup_entry+111
start_secondary+398
secondary_startup_64+165
]: 11430
Stack traces, user (BPF stackmap)
TODO: lacks symbol translation. See #23.
# ./bpftrace -e 'kprobe:do_sys_open { @[ustack] = count(); }'
Attaching 1 probe...
^C
@[
0x7f572d91b010
0x20555043206c6c61
]: 1
@[
0x7f572d91b010
0x70692f74656e2f73
]: 1
@[
0x7f572d91b010
0x7f572dfc1282
0x6c2f74656e2f7373
]: 1
@[
0x7fd85a3dac60
0x163de30
0x1612140
0x1805440300000001
]: 1
[...]
Overwrite ring buffers (perf_event_attr.write_backward)
TODO
Front-end features
Static tracing, user-level (USDT probes via uprobes)
TODO: see issue #29
Something like this should be possible:
bpftrace -e 'usdt:pthread:pthread_create { printf("created thread\n"); }'
Per-event output (printf())
DONE
# bpftrace -e 'kprobe:sys_nanosleep { printf("sleep by %d\n", tid); }'
Attaching 1 probe...
sleep by 1396
sleep by 3669
sleep by 1396
^C
Interval output
DONE
# ./bpftrace -e 'kprobe:do_sys_open { @opens = @opens + 1; } profile:s:1 /cpu == 0/ { printf("opens/sec: %d\n", @opens); @opens = 0; }'
Attaching 2 probes...
opens/sec: 16
opens/sec: 2
opens/sec: 3
opens/sec: 15
opens/sec: 8
opens/sec: 2
^C
@opens: 2
An possible future enhancement would be to add a separate provider for this, based on profile, rather then needing the /cpu == 0/ test.
Histogram printing (quantize())
DONE
# ./bpftrace -e 'kretprobe:sys_read { @bytes = quantize(retval); }'
Attaching 1 probe...
^C
@bytes:
[0, 1] 7 |@@@@@@@@@@@@@ |
[2, 4) 3 |@@@@@ |
[4, 8) 8 |@@@@@@@@@@@@@@ |
[8, 16) 9 |@@@@@@@@@@@@@@@@ |
[16, 32) 0 | |
[32, 64) 1 |@ |
[64, 128) 1 |@ |
[128, 256) 0 | |
[256, 512) 3 |@@@@@ |
[512, 1k) 0 | |
[1k, 2k) 12 |@@@@@@@@@@@@@@@@@@@@@@ |
[2k, 4k) 28 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
C struct navigation, kernel-level (rewriter maps to bpf_probe_read()s)
TODO: see issue #30
Something like this should work:
bpftrace -e 'kprobe:sys_nanosleep { printf("secs: %d\n", arg0->tv_nsec); }
or
bpftrace -e 'kprobe:sys_nanosleep { printf("secs: %d\n", ((struct timespec *)arg0)->tv_nsec); }'
Symbol resolution, kernel-level
TODO: see issue #32
Support exists for stack traces. This feature would be a way to resolve a single kernel address to a symbol string.
Symbol resolution, user-level
TODO: see issue #33
Yeah this looks good to me!