vere MDB_NOTFOUND: No matching key/data pair found after full disk

Just found out my VPS that hosts 3 urbit ships had ran out of disk space and so those ships had crashed. I ran chop on 2 of them and they started working fine. My main (~datryn-ribdun) was giving me some "loom corrupt" and mentioned "north" (the exact error is lost after some VPS reboots), but I remembered that deleting /.urb/chk lets you trigger a replay of all events that resolves snapshot corruption issues. Now when I run datryn-ribdun/.run --loom 32 (my usual command) I get the typical start lines

urbit 2.12
boot: home is /home/urbit/urbit-ships/datryn-ribdun
loom: mapped 2048MB
lite: arvo formula 2a2274c9
lite: core 4bb376f0
lite: final state 4bb376f0
loom: mapped 4096MB
boot: protected loom
live: logical boot
boot: installed 661 jets
------------------playback starting ----------------------
pier: replaying events 1-2907645618
lmdb: read: initial cursor_get failed at 1: MDB_NOTFOUND: No matching key/data pair found
pier: disk read bail

I also tried <pier>/.run vile and got vile: unable to extract key file i was pretty confident that deleting <pier>/.urb/chk was safe, but now I'm worried I somehow deleted some key file. Looking at <pier>/.urb/log is still 37G, so i believe I still have my event history. Any ideas how to proceed? I'd hate to have to breach my main

Vere 2.12

Jan 21 '24 15:01 datryn-ribdun

Issue is also present for roll on the develop branch. AFAICT it happens any time the checkpoint is deleted on a pier that has been rolled or (apparently) chopped. ~Nothing to do with the full disk.~

Jan 21 '24 16:01 mrdomino

May not be that cut-and-dry. I should say instead: I have experienced the MDB_NOTFOUND error as well on piers that have been rolled on 3.0 prerelease.

My testing so far (IIRC - this was yesterday or so) has revealed, all on 3.0 prerelease:

Delete chk, no roll: pier replays events successfully
Delete chk, roll: MDB_NOTFOUND
Don't delete chk, roll: no errors

Jan 21 '24 16:01 mrdomino

^^ Seems like a related issue, but I never ran roll and never even had a successful chop because it was complaining about loom being corrupted.

Jan 21 '24 16:01 datryn-ribdun

The roll issue seems easily resolvable; just a matter of the correct checkpoint not being copied in. Manually copying in the north.bin and south.bin from the checkpoint fixes it.

Are there any contents under .urb/chk on your pier? In the error state, I had a north.bin and south.bin that were both size 0.

Jan 22 '24 01:01 mrdomino

Yup I see,

-rw-rw-r-- 1 urbit urbit    0 Jan 15 22:38 north.bin
-rw-rw-r-- 1 urbit urbit    0 Jan 15 22:38 south.bin

urbit is my user on this vm.

Jan 27 '24 03:01 datryn-ribdun

I just tried rm -r .urb/chk followed by ./.run play and get the following

loom: mapped 2048MB
boot: protected loom
live: logical boot
boot: installed 661 jets
lmdb: read: initial cursor_get failed at 1: MDB_NOTFOUND: No matching key/data pair found
boot: read failed
mars: boot fail

Jan 27 '24 03:01 datryn-ribdun

You don't have any other good checkpoints, e.g. under bhk?

Jan 27 '24 14:01 mrdomino

I had no idea bhk was backup that could be swapped in for chk. Tried a cp bhk/* chk/ and started the ship. It' been replaying for a few hours, so hopefully this will work.

Assuming this fixes things, there's probably 2 things that could be improved with vere:

If there is no .urb/chk/ directory, why does vere make one and the create a 0byte north.bin and south.bin, then complain that "No matching key/data pair found"? Seems like before this point there should be a failure for No bin files found, did you delete chk/? Try moving the .bin files from .urb/bhk into .urb/chk. IDK on wording, but someway to not scare the user into thinking their ship is perma-broken.
Not filling disk to 0b remaining. Once disk is full its a pain to have to find something to delete, then chop, then boot ship to make sure things work, then delete backup chop. I might be overfitting and thinking this is a more general problem than it actually is, but anyone who runs on a cheap VPS probably runs on <100GB of disk and a well used ship can easily pass that if you're not regularly chopping.

Jan 28 '24 04:01 datryn-ribdun

After many hours of

pier: ($event_number): play: done
pier: ($event_number+1): play: done

my terminal was spammed ith

recover: top: meme

recover: top: meme

recover: top: meme
.....
....
loom: external fault: 0x50

Jan 28 '24 16:01 datryn-ribdun

Trying again with ./.run play --loom 32 and killing all other RAM heavy processes running on this VPS.

Jan 28 '24 21:01 datryn-ribdun

Tried with above command and even ./.run --loom 33 thinking that maybe adding some loom headroom would help, but every time I hit the same issue of

recover: top: meme
loom: external fault: 0x50 (0x20000000 : 0x280000000)

Assertion '0' failed in pkg/noun/manage.c:1791
home:bailing out
Aborted

Jan 29 '24 04:01 datryn-ribdun

seems related https://github.com/urbit/urbit/issues/6989

May 13 '24 21:05 Tenari