bpftrace icon indicating copy to clipboard operation
bpftrace copied to clipboard

features by example

Open brendangregg opened this issue 6 years ago • 1 comments

I created this document a while ago when I was evaluating bpftrace, and realized that it may make a useful addition (after cleaning it up a bit more): eg, as docs/feature_examples.txt. Sound ok?

I wrote it to see what was done and not done, and was pleased to find that bpftrace is coming along very well (70% features done, excluding a couple that are minor). I just made a pass of this document and filed tickets for the missing things.


Features by example

Kernel-level features

Dynamic tracing, kernel-level (kprobes)

DONE

# bpftrace -e 'kprobe:sys_nanosleep { printf("sleep by %d\n", tid); }'
Attaching 1 probe...
sleep by 1396
sleep by 3669
sleep by 1396
sleep by 27662
sleep by 3669
^C

Dynamic tracing, kernel-level arguments (kprobes)

DONE

# ./bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
Attaching 1 probe...
opening: /proc/cpuinfo
opening: /proc/stat
opening: /proc/diskstats
opening: /proc/stat
opening: /proc/vmstat
[...]
# ./bpftrace -e 'kprobe:do_sys_open { printf("open flags: %d\n", arg2); }'
Attaching 1 probe...
open flags: 557056
open flags: 32768
open flags: 32768
open flags: 32768
[...]
# ./bpftrace -e 'kretprobe:do_sys_open { printf("returned: %d\n", retval); }'
Attaching 1 probe...
returned: 8
returned: 21
returned: -2
returned: 21
[...]

Dynamic tracing, user-level (uprobes)

DONE

# ./bpftrace -e 'uretprobe:/bin/bash:readline { printf("read a line\n"); }'
Attaching 1 probe...
read a line
read a line
read a line
read a line
^C

Dynamic tracing, user-level arguments (uprobes)

DONE

# ./bpftrace -e 'uprobe:/bin/bash:readline { printf("arg0: %d\n", arg0); }'
Attaching 1 probe...
arg0: 19755784
arg0: 19755016
arg0: 19755784
^C
# ./bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc-2.23.so:fopen { printf("fopen: %s\n", str(arg0)); }'
Attaching 1 probe...
fopen: /proc/filesystems
fopen: /usr/share/locale/locale.alias
fopen: /proc/self/mountinfo
^C
# ./bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"\n", str(retval)); }'
Attaching 1 probe...
readline: "echo hi"
readline: "ls -l"
readline: "date"
readline: "uname -r"
^C

Static tracing, kernel-level (tracepoints)

DONE

# ./bpftrace -e 'tracepoint:block:block_rq_insert { printf("block I/O created by %d\n", tid); }'
Attaching 1 probe...
block I/O created by 28922
block I/O created by 3949
block I/O created by 883
block I/O created by 28941
block I/O created by 28941
block I/O created by 28941
[...]

Static tracing, kernel-level arguments (tracepoints)

TODO: see issue #26

Something like this should work (inspired by bcc's trace):

bpftrace -e 'tracepoint:block:block_rq_insert { printf("sectors: %d\n", args->nr_sector); }'

Timed sampling events (BPF with perf_event_open)

DONE

# ./bpftrace -e 'profile:hz:99 { @[tid] = count(); }'
Attaching 1 probe...
^C

@[32586]: 98
@[0]: 579

PMC events (BPF with perf_event_open)

TODO: see issue #27

This is performance monitoring counter events. bcc has an example tool: llcstat. An interface might look like this:

bpftrace -e 'perf:cpu-cycles:10000000 { @[tid] = count(); }'

That would fire once for every 10000000 cpu-cycles.

Filtering (via BPF programs)

DONE

# ./bpftrace -e 'kprobe:sys_read /arg2 < 16/ { printf("small read: %d byte buffer\n", arg2); }'
Attaching 1 probe...
small read: 8 byte buffer
small read: 8 byte buffer
small read: 8 byte buffer
small read: 8 byte buffer
small read: 8 byte buffer
small read: 12 byte buffer
^C
# ./bpftrace -e 'kprobe:sys_read /comm == "bash"/ { printf("read by %s\n", comm); }'
Attaching 1 probe...
read by bash
read by bash
read by bash
read by bash
^C

Per-event output

DONE

# ./bpftrace -e 'kprobe:sys_nanosleep { printf("sleep by %d\n", tid); }'
Attaching 1 probe...
sleep by 3669
sleep by 1396
sleep by 3669
sleep by 1396
[...]

Basic variables (global & per-thread variables via BPF maps, and scratch variables)

DONE

global (@start):

# ./bpftrace -e 'BEGIN { @start = nsecs; }
    kprobe:sys_nanosleep /@start != 0/ { printf("at %d ms: sleep\n", (nsecs - @start) / 1000000); }'
Attaching 2 probes...
at 437 ms: sleep
at 647 ms: sleep
at 1098 ms: sleep
at 1438 ms: sleep
at 1648 ms: sleep
^C

@start: 4064438886907216

per-thread (@start[tid]):

# ./bpftrace -e 'kprobe:sys_nanosleep { @start[tid] = nsecs; }
    kretprobe:sys_nanosleep /@start[tid] != 0/ { printf("slept for %d ms\n", (nsecs - @start[tid]) / 1000000); @start[tid] = delete(); }'
Attaching 2 probes...
slept for 1000 ms
slept for 1000 ms
slept for 1000 ms
slept for 1009 ms
slept for 2002 ms
[...]

scratch ($delta):

# ./bpftrace -e 'kprobe:sys_nanosleep { @start[tid] = nsecs; }
    kretprobe:sys_nanosleep /@start[tid] != 0/ { $delta = nsecs - @start[tid]; printf("slept for %d ms\n", $delta / 1000000); @start[tid] = delete(); }'
Attaching 2 probes...
slept for 1000 ms
slept for 1000 ms
slept for 1000 ms

Associative arrays (via BPF maps)

DONE

The per-thread @start[tid] is an example of this:

# ./bpftrace -e 'kprobe:sys_nanosleep { @start[tid] = nsecs; }
    kretprobe:sys_nanosleep /@start[tid] != 0/ { printf("slept for %d ms\n", (nsecs - @start[tid]) / 1000000); @start[tid] = delete(); }'
Attaching 2 probes...
slept for 1000 ms
slept for 1000 ms
slept for 1000 ms
[...]

Frequency counting (via BPF maps)

DONE

# ./bpftrace -e 'kprobe:do_sys_open { @who[comm] = count(); }'
Attaching 1 probe...
^C

@who[sleep]: 3
@who[snmp-pass]: 8
@who[ls]: 9
@who[snmpd]: 29

Histograms (power-of-2 and linear, via BPF maps)

DONE

power-of-2:

# ./bpftrace -e 'kretprobe:sys_read { @bytes = quantize(retval); }'
Attaching 1 probe...
^C

@bytes:
[0, 1]                 7 |@@@@@@@@@@@@@                                       |
[2, 4)                 3 |@@@@@                                               |
[4, 8)                 8 |@@@@@@@@@@@@@@                                      |
[8, 16)                9 |@@@@@@@@@@@@@@@@                                    |
[16, 32)               0 |                                                    |
[32, 64)               1 |@                                                   |
[64, 128)              1 |@                                                   |
[128, 256)             0 |                                                    |
[256, 512)             3 |@@@@@                                               |
[512, 1k)              0 |                                                    |
[1k, 2k)              12 |@@@@@@@@@@@@@@@@@@@@@@                              |
[2k, 4k)              28 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

power-of-2 by key:

# ./bpftrace -e 'kretprobe:do_sys_open { @bytes[comm] = quantize(retval); }'
Attaching 1 probe...
^C

@bytes[snmp-pass]:
[0, 1]                 0 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@bytes[ls]:
[0, 1]                 0 |                                                    |
[2, 4)                 9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@bytes[snmpd]:
[0, 1]                 1 |@@@@                                                |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                4 |@@@@@@@@@@@@@@@@@@                                  |
[16, 32)              11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

TODO: linear, see issue #28

Timestamps and time deltas (bpf_ktime_get_ns(), and BPF programs)

DONE

# ./bpftrace -e 'BEGIN { @start = nsecs; }
    kprobe:sys_nanosleep /@start != 0/ { printf("at %d ms: sleep\n", (nsecs - @start) / 1000000); }'
Attaching 2 probes...
at 437 ms: sleep
at 647 ms: sleep
at 1098 ms: sleep
at 1438 ms: sleep
^C

Stack traces, kernel (BPF stackmap)

DONE

# ./bpftrace -e 'kprobe:ip_output { @[stack] = count(); }'
Attaching 1 probe...
[...]
@[
ip_output+1
tcp_transmit_skb+1308
tcp_write_xmit+482
tcp_release_cb+225
release_sock+64
tcp_sendmsg+49
sock_sendmsg+48
sock_write_iter+135
__vfs_write+247
vfs_write+179
sys_write+82
entry_SYSCALL_64_fastpath+30
]: 1708
@[
ip_output+1
tcp_transmit_skb+1308
tcp_write_xmit+482
__tcp_push_pending_frames+45
tcp_sendmsg_locked+2637
tcp_sendmsg+39
sock_sendmsg+48
sock_write_iter+135
__vfs_write+247
vfs_write+179
sys_write+82
entry_SYSCALL_64_fastpath+30
]: 9048
@[
ip_output+1
tcp_transmit_skb+1308
tcp_write_xmit+482
tcp_tasklet_func+348
tasklet_action+241
__do_softirq+239
irq_exit+174
do_IRQ+74
ret_from_intr+0
cpuidle_enter_state+159
do_idle+389
cpu_startup_entry+111
start_secondary+398
secondary_startup_64+165
]: 11430

Stack traces, user (BPF stackmap)

TODO: lacks symbol translation. See #23.

# ./bpftrace -e 'kprobe:do_sys_open { @[ustack] = count(); }'
Attaching 1 probe...
^C

@[
0x7f572d91b010
0x20555043206c6c61
]: 1
@[
0x7f572d91b010
0x70692f74656e2f73
]: 1
@[
0x7f572d91b010
0x7f572dfc1282
0x6c2f74656e2f7373
]: 1
@[
0x7fd85a3dac60
0x163de30
0x1612140
0x1805440300000001
]: 1
[...]

Overwrite ring buffers (perf_event_attr.write_backward)

TODO

Front-end features

Static tracing, user-level (USDT probes via uprobes)

TODO: see issue #29

Something like this should be possible:

bpftrace -e 'usdt:pthread:pthread_create { printf("created thread\n"); }'

Per-event output (printf())

DONE

# bpftrace -e 'kprobe:sys_nanosleep { printf("sleep by %d\n", tid); }'
Attaching 1 probe...
sleep by 1396
sleep by 3669
sleep by 1396
^C

Interval output

DONE

# ./bpftrace -e 'kprobe:do_sys_open { @opens = @opens + 1; } profile:s:1 /cpu == 0/ { printf("opens/sec: %d\n", @opens); @opens = 0; }'
Attaching 2 probes...
opens/sec: 16
opens/sec: 2
opens/sec: 3
opens/sec: 15
opens/sec: 8
opens/sec: 2
^C

@opens: 2

An possible future enhancement would be to add a separate provider for this, based on profile, rather then needing the /cpu == 0/ test.

Histogram printing (quantize())

DONE

# ./bpftrace -e 'kretprobe:sys_read { @bytes = quantize(retval); }'
Attaching 1 probe...
^C

@bytes:
[0, 1]                 7 |@@@@@@@@@@@@@                                       |
[2, 4)                 3 |@@@@@                                               |
[4, 8)                 8 |@@@@@@@@@@@@@@                                      |
[8, 16)                9 |@@@@@@@@@@@@@@@@                                    |
[16, 32)               0 |                                                    |
[32, 64)               1 |@                                                   |
[64, 128)              1 |@                                                   |
[128, 256)             0 |                                                    |
[256, 512)             3 |@@@@@                                               |
[512, 1k)              0 |                                                    |
[1k, 2k)              12 |@@@@@@@@@@@@@@@@@@@@@@                              |
[2k, 4k)              28 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

C struct navigation, kernel-level (rewriter maps to bpf_probe_read()s)

TODO: see issue #30

Something like this should work:

bpftrace -e 'kprobe:sys_nanosleep { printf("secs: %d\n", arg0->tv_nsec); }

or

bpftrace -e 'kprobe:sys_nanosleep { printf("secs: %d\n", ((struct timespec *)arg0)->tv_nsec); }'

Symbol resolution, kernel-level

TODO: see issue #32

Support exists for stack traces. This feature would be a way to resolve a single kernel address to a symbol string.

Symbol resolution, user-level

TODO: see issue #33

brendangregg avatar Mar 07 '18 01:03 brendangregg

Yeah this looks good to me!

ajor avatar Mar 07 '18 21:03 ajor