vere icon indicating copy to clipboard operation
vere copied to clipboard

Can't use Vere 3.0 on Raspberry Pi

Open bazfum opened this issue 2 years ago • 8 comments

I have two piers on a Pi 4 4GB, a planet and a comet. The planet had issues with chop previously and would not upgrade until I moved it to my Mac. It ran on the Mac before I moved it back to the Pi. The comet did upgrade fine on the Pi. However, neither will run now on the Pi, with the same error:

rbit 3.0 boot: home is redacted disk: loaded epoch 0i795477796 loom: mapped 8192MB boot: protected loom live: mapped: GB/1.378.074.624 live: loaded: KB/16.384 boot: installed 967 jets loom: external fault: 0x10 (0x200000000 : 0x400000000)

Assertion '0' failed in pkg/noun/manage.c:1791

bail: oops home: bailing out Aborted (core dumped)

bazfum avatar Mar 12 '24 03:03 bazfum

@bazfum this looks like a null pointer dereference somewhere in startup (address 0x10 is a 16-byte offset from NULL). The best way to track this down would be to reproduce it in a debugger. Can you try to start this pier inside of gdb and capture a backtrace?

gdb --args ....normal command to restart urbit....
handle SIGSEGV nostop noprint
b manage.c:1791
continue
... wait for crash ...
bt

joemfb avatar Mar 12 '24 13:03 joemfb

I wasn't able to replicate on a RPi 4 8GB. Chop and upgrade worked fine.

midden-fabler avatar Mar 12 '24 14:03 midden-fabler

This is on 64-bit Bullseye if that makes a difference.

When I ran it in GDB, I get this:

Program received signal SIGILL, Illegal instruction. 0x0000000000608a88 in _armv8_pmull_probe ()

The backtrace gives: #0 0x0000000000608a88 in _armv8_pmull_probe () #1 0x00000000004039ac in OPENSSL_cpuid_setup () #2 0x0000000000749084 in __libc_start_init () #3 0x00000000007490ac in libc_start_main_stage2 ()

It then repeats that same line #3 until I kill it.

bazfum avatar Mar 12 '24 19:03 bazfum

Apparently SIGILL is normal during openssl setup on arm, see https://stackoverflow.com/questions/25708907/ssl-library-init-cause-sigill-when-running-under-gdb.

Can you try again, first setting handle SIGILL nostop noprint to let the library generate and catch that exception?

joemfb avatar Mar 12 '24 19:03 joemfb

loom: external fault: 0x10 (0x200000000 : 0x400000000)

Breakpoint 1, u3m_fault (ser_i=, adr_v=) at pkg/noun/manage.c:1791 1791 pkg/noun/manage.c: No such file or directory.

(gdb) bt #0 u3m_fault (ser_i=, adr_v=) at pkg/noun/manage.c:1791 #1 u3m_fault (adr_v=, ser_i=) at pkg/noun/manage.c:1776 #2 0x0000000000748530 in sigsegv_handler () #3 #4 0x000000000074ae70 in get_meta () #5 0x000000000074b27c in __libc_free () Backtrace stopped: previous frame identical to this frame (corrupt stack?)

bazfum avatar Mar 12 '24 19:03 bazfum

@bazfum sorry for the delay. I'm not sure what to make of this trace. It looks like it might be a double free, or trying to free a pointer into the stack. But it also might just be arbitrary heap corruption -- all bets are off. We have not been able to reproduce this crash.

I think this will require interactive debugging. I'd be happy to join a video call next week and try to find the root cause we can coordinate on urbit (I'm ~master-morzod) or over email (joe at urbit.org). Alternately, I could send you a binary with lots of extra printfs during initialization, that might help narrow it down.

joemfb avatar Mar 15 '24 19:03 joemfb

I replied via email. let me know if you didn't get it.

bazfum avatar Mar 16 '24 04:03 bazfum

FWIW I ended up grabbing a used mini PC and moving my pier over, everything is happy on the new system. I'd been debating moving off the Pi before all this, so no worries if it's not worth anyones time to troubleshoot on the Pi.

bazfum avatar Mar 19 '24 05:03 bazfum