MDB_NOTFOUND: No matching key/data pair found after full disk
Just found out my VPS that hosts 3 urbit ships had ran out of disk space and so those ships had crashed. I ran chop on 2 of them and they started working fine. My main (~datryn-ribdun) was giving me some "loom corrupt" and mentioned "north" (the exact error is lost after some VPS reboots), but I remembered that deleting datryn-ribdun/.run --loom 32 (my usual command) I get the typical start lines
urbit 2.12
boot: home is /home/urbit/urbit-ships/datryn-ribdun
loom: mapped 2048MB
lite: arvo formula 2a2274c9
lite: core 4bb376f0
lite: final state 4bb376f0
loom: mapped 4096MB
boot: protected loom
live: logical boot
boot: installed 661 jets
------------------playback starting ----------------------
pier: replaying events 1-2907645618
lmdb: read: initial cursor_get failed at 1: MDB_NOTFOUND: No matching key/data pair found
pier: disk read bail
I also tried <pier>/.run vile and got vile: unable to extract key file
i was pretty confident that deleting <pier>/.urb/chk was safe, but now I'm worried I somehow deleted some key file. Looking at <pier>/.urb/log is still 37G, so i believe I still have my event history. Any ideas how to proceed? I'd hate to have to breach my main
Vere 2.12
Issue is also present for roll on the develop branch. AFAICT it happens any time the checkpoint is deleted on a pier that has been rolled or (apparently) chopped. ~Nothing to do with the full disk.~
May not be that cut-and-dry. I should say instead: I have experienced the MDB_NOTFOUND error as well on piers that have been rolled on 3.0 prerelease.
My testing so far (IIRC - this was yesterday or so) has revealed, all on 3.0 prerelease:
- Delete
chk, no roll: pier replays events successfully - Delete
chk, roll:MDB_NOTFOUND - Don't delete
chk, roll: no errors
^^ Seems like a related issue, but I never ran roll and never even had a successful chop because it was complaining about loom being corrupted.
The roll issue seems easily resolvable; just a matter of the correct checkpoint not being copied in. Manually copying in the north.bin and south.bin from the checkpoint fixes it.
Are there any contents under .urb/chk on your pier? In the error state, I had a north.bin and south.bin that were both size 0.
Yup I see,
-rw-rw-r-- 1 urbit urbit 0 Jan 15 22:38 north.bin
-rw-rw-r-- 1 urbit urbit 0 Jan 15 22:38 south.bin
urbit is my user on this vm.
I just tried rm -r .urb/chk followed by ./.run play and get the following
loom: mapped 2048MB
boot: protected loom
live: logical boot
boot: installed 661 jets
lmdb: read: initial cursor_get failed at 1: MDB_NOTFOUND: No matching key/data pair found
boot: read failed
mars: boot fail
You don't have any other good checkpoints, e.g. under bhk?
I had no idea bhk was backup that could be swapped in for chk.
Tried a cp bhk/* chk/ and started the ship. It' been replaying for a few hours, so hopefully this will work.
Assuming this fixes things, there's probably 2 things that could be improved with vere:
- If there is no
.urb/chk/directory, why does vere make one and the create a 0byte north.bin and south.bin, then complain that "No matching key/data pair found"? Seems like before this point there should be a failure forNo bin files found, did you delete chk/? Try moving the .bin files from .urb/bhk into .urb/chk. IDK on wording, but someway to not scare the user into thinking their ship is perma-broken. - Not filling disk to 0b remaining. Once disk is full its a pain to have to find something to delete, then chop, then boot ship to make sure things work, then delete backup chop. I might be overfitting and thinking this is a more general problem than it actually is, but anyone who runs on a cheap VPS probably runs on <100GB of disk and a well used ship can easily pass that if you're not regularly chopping.
After many hours of
pier: ($event_number): play: done
pier: ($event_number+1): play: done
my terminal was spammed ith
recover: top: meme
recover: top: meme
recover: top: meme
.....
....
loom: external fault: 0x50
Trying again with ./.run play --loom 32 and killing all other RAM heavy processes running on this VPS.
Tried with above command and even ./.run --loom 33 thinking that maybe adding some loom headroom would help, but every time I hit the same issue of
recover: top: meme
loom: external fault: 0x50 (0x20000000 : 0x280000000)
Assertion '0' failed in pkg/noun/manage.c:1791
home:bailing out
Aborted
seems related https://github.com/urbit/urbit/issues/6989