vere `ames: bogus lane` + crashing on assert in `_ce_image

Observed on Linux AArch64 (Arch Linux ARM 5.19.8-1-aarch64-ARCH) vere version: release 1.20 (also occurs on 1.18)

My ship is bailing shortly after boot, typically 1-3 minutes after starting. I've inspected its output over multiple runs and it has printed ames: bogus lane prior to aborting every time. Because of this I cranked up ames logging a bit with %odd %for %rot %rcv %msg.

It doesn't appear to instantly crash upon any specific ames activity, but always a few seconds after the last debug print (usually a hear, sometimes a trying route, with similar output before that). I've snipped most of it in the log below for brevity:

$ ./urbit -t -p 54321 -b 127.0.0.1 --http-port 8080 --loom 32 nimpun-doller
~
urbit 1.20-936f24b394
boot: home is /home/nimpun-doller/nimpun-doller
loom: mapped 2048MB
lite: arvo formula 2a2274c9
lite: core 4bb376f0
lite: final state 4bb376f0
loom: mapped 4096MB
boot: protected loom
live: loaded: GB/2.157.395.968
boot: installed 652 jets
vere: checking version compatibility
%unlagging
ames: live on 54321
conn: listening on /home/nimpun-doller/nimpun-doller/.urb/conn.sock
http: web interface live on http://localhost:8080
http: loopback live on http://localhost:12322
pier (81851936): live
ames: ~wanzod: send-blob: to ~wanzod
ames: ~wanzod: trying route: ~wanzod
[...]
ames: ~disden-talhes: received packet
ames: ~disden-talhes: hear last in-progress [~disden-talhes seq=1 bone=5 fragment-num=0 num-fragments=1 la=0 lh=1]
[approx. 3 seconds later]
ames: bogus lane
lan: 0
home: bailing out
Aborted (core dumped)

GDB backtrace with debug build via bazel build --config=dbg :urbit from tag vere-v1.20 (1.20-936f24b394):

Core was generated by `./urbit serf /home/nimpun-doller/nimpun-doller 0:0:0:0 128 50000 31 0'.
Program terminated with signal SIGABRT, Aborted.
#0  0x000000000070ef98 in __restore_sigs ()
(gdb) bt
#0  0x000000000070ef98 in __restore_sigs ()
#1  0x000000000070eff0 in raise ()
#2  0x0000000000706358 in abort ()
#3  0x00000000007063f0 in __assert_fail ()
#4  0x0000000000440ae4 in _ce_image_blit (img_u=0xaf2e60 <u3e_Pool+32784>, ptr_w=0x280000000, stp_ws=4096) at pkg/noun/events.c:864
#5  0x000000000044153c in u3e_live (nuu_o=1 '\001', dir_c=0xffffddcafe6e "/home/nimpun-doller/nimpun-doller") at pkg/noun/events.c:1174
#6  0x000000000044bd44 in u3m_boot (dir_c=0xffffddcafe6e "/home/nimpun-doller/nimpun-doller", len_i=2147483648) at pkg/noun/manage.c:1857
#7  0x0000000000403a00 in _cw_serf_commence (argc=8, argv=0xffffddcaed58) at pkg/vere/main.c:1044
#8  0x00000000004067f8 in _cw_utils (argc=8, argv=0xffffddcaed58) at pkg/vere/main.c:2049
#9  0x0000000000406874 in main (argc=8, argv=0xffffddcaed58) at pkg/vere/main.c:2070

+vat %base for completeness:

%base
  /sys/kelvin:      [%zuse 415]
  base hash:        0v1j.vgj7i.m56f2.gmo7j.rimse.dtn6n.q63dm.ra989.mnplb.olc74.tin8p
  %cz hash:         0v1j.vgj7i.m56f2.gmo7j.rimse.dtn6n.q63dm.ra989.mnplb.olc74.tin8p
  app status:       running
  force on:         ~
  force off:        ~
  publishing ship:  ~
  updates:          remote
  source ship:      ~wanzod
  source desk:      %kids
  source aeon:      132
  kids desk:        %kids
  pending updates:  ~

Feb 15 '23 06:02 nimpun-doller

Update: After a meld, things seem to be working normally. I had last melded when upgrading to 1.18, so seeing it free this much is a bit surprising to me. Not sure if this behavior without giving --loom to meld is expected:

$ ./urbit meld nimpun-doller # note: no --loom
loom: mapped 2048MB
loom: image (north) blit read: Bad address
Assertion '0' failed in pkg/noun/events.c:864
Assertion failed: 0 (pkg/noun/events.c: _ce_image_blit: 864)
Aborted (core dumped)

$ ./urbit meld --loom 32 nimpun-doller
loom: mapped 4096MB
boot: protected loom
live: loaded: GB/2.157.395.968
boot: installed 652 jets
hash-cons arena:
  root: B/160
  atoms (2309247):
    refs: MB/70.491.560
    data: MB/315.390.063
    dict: MB/174.264.720
  total: MB/560.146.343
  cells (32856917):
    refs: MB/781.763.380
    dict: GB/3.127.053.520
  total: GB/3.908.816.900
total: GB/4.468.963.403

urbit: meld: gained: GB/1.002.031.680

$ ./urbit info nimpun-doller
loom: mapped 4096MB
boot: protected loom
live: loaded: GB/1.155.366.912
boot: installed 652 jets

urbit: nimpun-doller at event 81853190
  disk: live=&, event=81853754

lmdb info:
  map size: 536870912000
  page size: 4096
  max pages: 131072000
  number of pages used: 4200301
  last transaction ID: 81139792
  max readers: 127
  number of readers used: 0
  file size (page): 17204432896
  file size (stat): 17204432896

I'm no longer crashing after boot and things feel stable, so my issue seems to be resolved. I'll keep the core dump around in case there's anything to inspect that'd be useful for tracking this down, otherwise feel free to close this.

Feb 15 '23 21:02 nimpun-doller

This is interesting, thanks for reporting. @belisarius222 @barter-simsum what could cause so much meldable memory usage in so little time? What could we do to figure out what's causing that? Would probably need |mass output.

Feb 16 '23 12:02 jalehman

vere
vere copied to clipboard

`ames: bogus lane` + crashing on assert in `_ce_image_blit`

vere vere copied to clipboard

`ames: bogus lane` + crashing on assert in `_ce_image_blit`

vere
vere copied to clipboard