libspl/backtrace: dump registers in libunwind backtraces
Motivation and Context
In postmortem debugging, often all we have is the binary. Having the registers makes it a least possible to follow the disassembly and try to guess how we got here.
Description
libunwind knows how to fish all the registers out of the stack frame, not just IP. Loop over 'em and spit 'em out.
I've made no effort to do the same in the non-libunwind builds, because libc doesn't typically give you the tools to do it well/at all.
I'm not yet fully satisfied with this. I think I can do better by capturing the register state at the point the assert is tripped, but it will be a lot more invasive. I'll keep working on this, but I think this is still very useful right now.
How Has This Been Tested?
Induced a crash, enjoyed the output:
ASSERT at module/zfs/spa.c:5791:spa_open_common()
"oh no" == 0 (0x7fe60100db1b == 0)
PID: 870661 COMM: zdb
TID: 870661 NAME: zdb
Registers:
RAX: 0x0000000000000037 RDX: 0x0000000000000000 RCX: 0x0000000000000000
RBX: 0x0000000000000002 RSI: 0x00007ffe7eac2e60 RDI: 0x00007ffe7eac4770
RBP: 0x00007ffe7eac4f70 RSP: 0x00007ffe7eac46c0 R8: 0x0000000000000000
R9: 0x0000000000000073 R10: 0x00007ffe7eac4d30 R11: 0x0000000000000021
R12: 0x00007ffe7eac46f0 R13: 0x0000000000000000 R14: 0x00007ffe7eac4770
R15: 0x00007ffe7eac4b40 RIP: 0x00007fe600fd8d26
Call trace:
[0x00007fe600fd8c45] libspl_assertf+0x135 (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0x2ccc45)
[0x00007fe600e66bc7] spa_open_common+0x57 (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0x15abc7)
[0x00007fe600e2648f] dsl_pool_hold+0x1f (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0x11a48f)
[0x00007fe600dd93c7] dmu_objset_hold_flags+0x27 (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0xcd3c7)
[0x0000562e1ced46cc] open_objset+0x10c (in /home/robn/code/zfs/.libs/zdb +0x126cc)
[0x0000562e1ceccbfb] main+0x149b (in /home/robn/code/zfs/.libs/zdb +0xabfb)
[0x00007fe60068724a] __libc_start_call_main+0x7a (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x2724a)
[0x00007fe600687305] __libc_start_main_alias_2+0x85 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x27305)
[0x0000562e1cecccf1] _start+0x21 (in /home/robn/code/zfs/.libs/zdb +0xacf1)
I haven't tested on non-amd64 architectures, but the libunwind documentation doesn't say anything about particular facilities not being available. At worst, I'd expect it to show nothing, or maybe only the IP.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [ ] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
FYI, I managed to try this on a RPi4b this morning:
ASSERT at cmd/zdb/zdb.c:693:usage()
"ohno" == 0 (0x5555d71550 == 0)
PID: 62086 COMM: zdb
TID: 62086 NAME: zdb
Registers:
x0: 0x0000007fc44ce5c8 x1: 0x0000007f9c0bfb90 x2: 0x0000007fc44ce510
x3: 0x0000000000000000 x4: 0x0000007f9b5a89f0 x5: 0x0000000000000073
x6: 0x00000000fffffffd x7: 0x00000000ffffffff x8: 0x0000000000000040
x9: 0x0000007fc44cf150 x10: 0x0000000000000000 x11: 0x0000000000000000
x12: 0x000000000000000a x13: 0x00000000010ec6b3 x14: 0x0000000000001c30
x15: 0x00000000010ec6b3 ip0: 0x0000007f9bdff590 ip1: 0x0000007f9bf20040
x18: 0x0000000000001d90 x19: 0x000000000000f286 x20: 0x0000007f9c0753d0
x21: 0x0000000000000002 x22: 0x0000007f9b6d14e0 x23: 0x0000007fc44d16a1
x24: 0x00000000000002b5 x25: 0x0000005555d77490 x26: 0x0000007f9c075000
x27: 0xffffffffffffffff x28: 0x0000000000000000 fp: 0x0000007fc44ce410
lr: 0x0000007f9bdff610 sp: 0x0000007fc44ce410 pc: 0x0000007f9bdff610
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 ???: 0x0000000000000000 ???: 0x0000000000000000
???: 0x0000000000000000 v0: 0x0000000000000000 v1: 0x0000000000000000
v2: 0x0000000000000000 v3: 0x0000000000000000 v4: 0x0000000000000000
v5: 0x0000000000000000 v6: 0x0000000000000000 v7: 0x0000000000000000
v8: 0x0000000000000000 v9: 0x0000000000000000 v10: 0x0000000000000000
v11: 0x0000000000000000 v12: 0x0000000000000000 v13: 0x0000000000000000
v14: 0x0000000000000000 v15: 0x0000000000000000 v16: 0x0000000000000000
v17: 0x0000000000000000 v18: 0x0000000000000000 v19: 0x0000000000000000
v20: 0x0000000000000000 v21: 0x0000000000000000 v22: 0x0000000000000000
v23: 0x0000000000000000 v24: 0x0000000000000000 v25: 0x0000000000000000
v26: 0x0000000000000000 v27: 0x0000000000000000 v28: 0x0000000000000000
v29: 0x0000000000000000 v30: 0x0000000000000000 v31: 0x0000000000000000
fpsr: 0x0000000000000000 fpcr: 0x0000000000000000
Call trace:
[0x0000007f9bdff4bc] libspl_assertf+0x11c
[0x0000005555d60f4c] usage+0x4c
[0x0000005555d59cbc] main+0x5bc
[0x0000007f9b557740] __libc_init_first+0x80
[0x0000007f9b557818] __libc_start_main+0x98
[0x0000005555d5a9b0] _start+0x30
So I suspect I'm holding something slightly wrong, but also I may not have a chance to get back to this for a few days. It's not the worst thing, so probably still nbd to merge but if I get a chance before you do I'll look into it.
Thanks for checking this elsewhere. We'll just hold of merging this until you have a chance to sort that out.
Alright, sorted: was holding it wrong (libunwind is kinda weird, but also quite powerful).
Turns out register names are just a static lookup table by enum. Register existence is more about whether or not you can actually get the value for it - not every implementation of an architecture has every register. So now we're checking error codes and skipping registers if we couldn't get a value.
But, libunwind doesn't have a hardcoded name for every register it knows about, so for those, we have to show something. They're pretty niche though; the only one I have nearby is the ARM64 "PSTATE" register, which is some synthetic thing based on certain status flags (idk, I didn't look). So while we're not naming it perfectly, we can at least show it.
Put it together:
Registers:
x0: 0x0000007fc35348e8 x1: 0x0000000000025b4d x2: 0x0000000000000006
x3: 0x0000007f8a07f040 x4: 0x0000007f8a0c2b90 x5: 0x0000000000000001
x6: 0x0000000000000020 x7: 0x0000007f89520a78 x8: 0x0000000000000083
x9: 0x0000007f89504418 x10: 0x0000000000000000 x11: 0x0000000000000000
x12: 0x000000000000000a x13: 0x00000000010ec6b3 x14: 0x0000000000000008
x15: 0x00000000010ec6b3 ip0: 0x0000007f89dff590 ip1: 0x0000005555ae0af8
x18: 0x0000000000001d90 x19: 0x0000007fc3534830 x20: 0x0000000000000002
x21: 0x0000000000000006 x22: 0x0000007f896d14e0 x23: 0x0000007fc35396a1
x24: 0x00000000000002b5 x25: 0x0000005555ac7490 x26: 0x0000007f8a075000
x27: 0xffffffffffffffff x28: 0x0000000000000000 fp: 0x0000007fc3534720
lr: 0x0000007f89dff610 sp: 0x0000007fc3534720 pc: 0x0000007f89dff610
?21: 0x000000000000b302
Call trace:
[0x0000007f89dff610] libspl_backtrace+0x80
[0x0000005555ab1468] sig_handler+0x18
[0x0000007f8a0c07b0] ???
[0x0000007f895b09f0] pthread_key_delete+0x1a0
[0x0000007f8956a72c] gsignal+0x1c
[0x0000007f8955747c] abort+0xf0
[0x0000007f89dff4e8] libspl_assertf+0x148
[0x0000005555ab0f4c] usage+0x4c
[0x0000005555aa9cbc] main+0x5bc
[0x0000007f89557740] __libc_init_first+0x80
[0x0000007f89557818] __libc_start_main+0x98
[0x0000005555aaa9b0] _start+0x30
But there's more! The soup of write calls were getting out of hand, so I took the opportunity to make some macros to help with that. And then I figured screw it, I'll document the whole thing because its finicky and weird and rare that you should have to go in there, so more help rather than less is nice.
And that was worth it, because I found half a dozen small bugs and quirks. Nothing dangerous, but there because it was so hard to see!
I now have designs on a more comprehensive crash output, especially for assert/verify which then raises a signal, but I'll do that in a later PR, because it touches more stuff than just this file.
Very nice!