linux-nova
linux-nova copied to clipboard
Regarding recovery of NOVA after crash
This issue has been raised to make sure if NOVA recovers correctly on mounting after a crash. The workload that is used to check recovery is the following:
- mount nova (with -o init) on pmem0
- emulate a crash
- check whether NOVA recovers correctly
The detailed sequence of steps are as follows: 1. Create an empty NOVA file system on pmem0 (mount -t NOVA -o init /dev/pmem0 /mnt/pmem0) 2. Take a snapshot of pmem0 (which should include the mkfs and mount data) 3. crash 4. Restore the snapshot on pmem0 device 5. mount NOVA (not init, just mount) (mount -t NOVA /dev/pmem0 /mnt/pmem0)
Here, NOVA fails to mount at step 5. This should ideally work, because the snapshot taken at step 2 contains all the data regarding the initialization of NOVA. So in step 5, after the initialization data has copied to pmem0, it should see the initialization data of NOVA and mount the file system.
The error in dmesg is: [ 1208.605348] nova: nova_get_nvmm_info: dev pmem1, phys_addr 0x48000000, virt_addr ffffc90008000000, size 134217728 [ 1208.615478] nova: measure timing 0, metadata checksum 0, inplace update 0, wprotect 0, data checksum 0, data parity 0, DRAM checksum 0 [ 1208.630455] nova: Start NOVA snapshot cleaner thread. [ 1208.635824] nova: Running snapshot cleaner thread [ 1208.643671] nova: NOVA: Failure recovery [ 1208.649243] nova: Recovered 0 snapshots, latest epoch ID 0 [ 1208.660627] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 1208.666171] IP: nova_traverse_inode_log.isra.10+0x29/0x100
NOVA is on kernel version 4.13 and on a pmem device of size 128MB.
Thanks for reporting the issue. I am trying to reproduce the issue but failed.
First, we have updated the master branch to 4.18, so please try the latest master branch.
Second, the steps are not very clear. I am not sure how you emulate a crash. Is there a tool to do that? In NOVA development, I make nova_try_normal_recovery() returns false to emulate the crash.
Here are my reproduce steps based on the description:
mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk/ touch /mnt/ramdisk/test1 echo 1 > /proc/fs/NOVA/pmem0/create_snapshot umount /mnt/ramdisk mount -t NOVA -o snapshot=0 /dev/pmem0 /mnt/ramdisk/ umount /mnt/ramdisk mount -t NOVA /dev/pmem0 /mnt/ramdisk/
I don't see a crash; Please specify your exact commands to reproduce the issue.
Info in dmesg: [ 330.976369] nova: nova_get_nvmm_info: dev pmem0, phys_addr 0x100000000, virt_addr 0xffff9b6d40000000, size 3221225472 [ 330.976371] nova: measure timing 0, metadata checksum 0, wprotect 0, data checksum 0, data parity 0, DRAM checksum 0 [ 330.976528] nova: Start NOVA snapshot cleaner thread. [ 330.976547] nova: NOVA: Failure recovery [ 330.976552] nova: Running snapshot cleaner thread [ 330.976688] nova: Restore snapshot epoch ID 0 [ 330.976697] nova: Recovered 1 snapshots, latest epoch ID 0 [ 330.989939] nova: Failure recovery total recovered 2 [ 330.990410] nova: Current epoch id: 0
Also I tried to compile CrashMonkey; Seems it does not work with 4.18 yet. Is there a simple way to emulate the crash and reproduce the bug?
Hi Andiry,
It seems there is a misunderstanding. I'll try to clarify, but my students will be able to provide more detail.
We are not saying a sequence of commands causes a kernel crash. We are saying once NOVA has been mounted, if there is a power loss, it does not seem to recover correctly.
Our sequence of steps to reproduce this (roughly):
- Mount nova (similar to what you are doing)
- Use dd to copy the state of the entire pmem device
- unmount nova
- Copy over the saved pmem-device state onto the pmem device
- Now try to mount nova over the pmem device
What this emulates is that there is a power-loss crash after NOVA was mounted, and hence it didn't have a chance to cleanly unmount. From this state, it seems like NOVA isn't able to recover correctly.
To reproduce the reported bug, you don't need CrashMonkey at all.
Hope this helps!
Thanks Vijay for the clarification. I never tried using dd to emulate power loss before, will try to reproduce with your steps. I think I get confused when Rohan mentioned "taking a snapshot" he was meaning using dd but I was thinking of the snapshot support in NOVA.
Yes, I realized it was ambiguous when I saw your response! Apologies for the delay -- everyone is traveling for winter break, or someone in my group would have responded sooner.
Let us know if you run into any problems reproducing it! I see the last commit to master is in Oct; we used the master branch, so our experiments should be reproducible on master.
Hi, Vijay’s description should help you reproduce the issue. To add to it, we had some issues running kernel 4.18(the master branch). The pmem devices were not recognized on reboot. So we switched back to the earlier 4.13 kernel. Is there anything else we need to enable in the menu config during compilation, in addition to CONFIG_X86_PMEM_LEGACY, CONFIG_FS_NOVA and all subitems under Device Drivers > NVDIMM ?
I tried on 4.18 but fail to reproduce:
mount -t NOVA -o init /dev/pmem0 /mnt dd if=/dev/pmem0 of=pmem0ss bs=1M umount /mnt dd if=pmem0ss of=/dev/pmem0 bs=1M mount -t NOVA /dev/pmem0 /mnt
My colleague Juno tried on 4.13 and failed to reproduce as well. Is it 100% reproducible? Do I need to perform some file operations before running dd?
@jayashreemohan29 I have attached my 4.18 config. Please remove the .txt suffix.
The pmem problem has plagued me on and off. I just keep rebooting until they show up.
-steve
-- Composed on (and maybe dictated to) my phone.
On Dec 17, 2018, at 17:20, Jayashree Mohan [email protected] wrote:
Hi, Vijay’s description should help you reproduce the issue. To add to it, we had some issues running kernel 4.18(the master branch). The pmem devices were not recognized on reboot. So we switched back to the earlier 4.13 kernel. Is there anything else we need to enable in the menu config during compilation, in addition to CONFIG_X86_PMEM_LEGACY, CONFIG_FS_NOVA and all subitems under Device Drivers > NVDIMM ?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Andiry,
The steps that we followed which led us to the recovery problem were:
- mount -t NOVA -o init /dev/pmem0 /mnt
- dd if=/dev/pmem0 of=pmem0ss bs=1M count=128 (Our pmem0 partition size is 128MB)
- umount /mnt
- dd if=/dev/zero of=/dev/pmem0 bs=1M count=128 (The pmem0 device file is completely cleared)
- dd if=pmem0ss of=/dev/pmem0 bs=1M count=128
- mount -t NOVA /dev/pmem0 /mnt
I think you missed step 4. For us, it is 100% reproducible with these steps, on the 4.13 kernel.
Could you specify the commit you tested and share your .config file?
The steps still don't reproduce the bug even after adding step 4.
Thanks Rohan. I tried your steps with clearing pmem0 device on both 4.18 and 4.13, but fail to reproduce.
Are you testing on a VM or bare-metal machine? We found some weird issues when running NOVA on VM.
Anyway, can you apply the patch attached, reproduce and post the dmesg? Thanks. test.patch.txt
Andiry, how big of a pmem partition are you using? Perhaps the bug is only exposed with small partitions? I think we are using the same kernel version, and same NOVA version. So I'm trying to narrow down what else could be different.
I think bug is reproducible on both bare-metal and virtual machine on us, but I'll let @rohankadekodi confirm.
Typically I am using 4GB, but I will try the small partitions.
Hi Andiry,
just tried the same steps on bare-metal, and found that the bug is not reproducible on bare-metal. So, this is a problem of NOVA running in a virtual machine. Could you try running the 6 steps mentioned here in a virtual machine with size of pmem0 as 128MB?
I will apply the patch and post the dmesg of NOVA when running in a virtual machine.
Thanks, Rohan
Thanks for confirming Rohan. I will try on VM.
I tried Ubuntu 18.04.1 on VM and still did not reproduce. I tried 4.13 and 4.18.
Hi Andiry, We just figured out that we are using the lightweight Arch-Linux distribution on our VM, and the issue only shows up in this one so far. We will get back to you if we are able to reproduce it on Ubuntu. Thank you for investigating this issue with us, we really appreciate it. And sorry for not having figured out the distribution earlier.
Thanks, Jayashree Mohan
On Fri, Dec 21, 2018 at 12:46 PM Andiry Xu [email protected] wrote:
I tried Ubuntu 18.04.1 on VM and still did not reproduce. I tried 4.13 and 4.18.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVSL/linux-nova/issues/72#issuecomment-449286572, or mute the thread https://github.com/notifications/unsubscribe-auth/AdB-tMdRteyfvebM-PKuQEfjaqruW72bks5u7Iq6gaJpZM4ZC9xW .
That's OK and thank you for the help. We always welcome people to try NOVA and report issues.
Hi all. I was able to reproduce this in a very straightforward manner from the latest Ubuntu 18.04 install.
- Install ubuntu server 18.04 into VirtualBox VM (QEMU reproduces this problem too, but I wanted to try a different hypervisor)
- Compile linux-nova (current master branch) with this config config.txt
- Transfer the bzImage into VM and boot from it with command line
memmap=128M!1G
- Run the steps of the above comment
- NOVA fails to remount, dmesg log: out.txt