tuxonice-kernel-old
tuxonice-kernel-old copied to clipboard
submit_bio+0xf/0x10e: ext4 error at the end of hibernating
This happens with tuxonice at the end of hibernate operation, but not when using vanilla.
Thanks Harald.
Would you attach your .config too? I'm not seeing the same thing here so perhaps there's some config option that's making the difference. Does it happen on the first try, and every time?
Sure, here it is:
Note that this also happened with 4.12, the errors were the same. It also doesn't happen every cycle, most times it works fine.
I still get the identical error message with 4.13.4. Any idea what causes this?
I think maybe this does happen after having mounted (and unmounted) an usb device, but I still have to test this.
But it could also be the squashfs that I have mounted via a loop device in /etc/fstab:
# SquashFS storages
/usr/local/share/icons.squashfs /usr/local/share/icons squashfs auto,ro,loop 0 0
In fact, the loop0 PID 1174 showing up in the trace is started by this very mount point. Maybe it needs even more to reproduce this: That loop0 mountpoint, then mounting and unmounting an additional device via loopback.
This is most likely the 3rd of the issues I mentioned in #34 -- but you were the first one to provide a trace. I only got a hard lockup at the end of suspending to disk. I stumbled upon this when having a truecrypt volume (file) mounted what is in fact making use of loop, too. My simple scenario to reproduce:
- on one other partition a testfile, formatted e.g. as ext4 (but FS doesn't seem to matter)
- mount it via loop to e.g. /mnt/test
- hibernate with TOI result: hang at the end of image writing
Unfortunately, this problem already exists for a long time and I cannot recall anymore when it came up first time.
EDIT: I just retested this with an even simpler scenario: The loop-mounted file doesn't need to reside on a different partition than root. So, it's almost the same way to reproduce like @hjudt 's.
Now I see what you mean: I also got a hang without a trace at the end of image writing. Nigel, any idea what could be wrong? Could you fix this please?
Ok. So any loopback mount should make it reproducable? I'll give that a try.
@NigelCunningham: On my system this scenario is reproducing this failure at 100% rate. I also hope that you find a fix for it. The written image (although much is written before the hang) doesn't identify as TOI recoverable image, at resume. So I assume there's a little "miracle" missing there at the end.
Umounting the loopback-ed device before hibernating makes TOI work well on my system.
Today I got a slightly different backtrace on the laptop (thinkpad t440s):
I also got a similar trace on another desktop machine (which is not able to resume with either vanilla nor tuxonice hibernation).
While removing loop and the loop-mounted filesystem helped with the image getting finally written on both machines, resuming on the laptop broke after stage 1 with a blank screen. So I am unfortunately back again with using vanilla hibernation, which is not exactly brilliant but works on the laptop and my primary desktop machine.
Ok, I had to replace a hard disk in my primary desktop machine and after migrating the data to the new one onto encrypted partitions (previously unencrypted), I cannot use tuxonice on this anymore. On hibernating I see messages being spilled over the splash screen for the glimpse of a second (but hidden again behind the splash image), then after resuming I find this in the dmesg:
Initiating a hibernation cycle. Console is 75x240. Using configuration file /etc/splash/tuxonice/1920x1200.cfg. Framebuffer support initialised successfully. Starting other threads. Freezing user space processes ... (elapsed 0.001 seconds) done. OOM killer disabled. Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. hpet1: lost 1 rtc interrupts hpet1: lost 2 rtc interrupts Restarting kernel threads ... done. Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. hpet1: lost 2 rtc interrupts ...20%...40%...60% ------------[ cut here ]------------ kernel BUG at block/blk-core.c:2242! invalid opcode: 0000 [#1] PREEMPT SMP Modules linked in: cfg80211 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport ipv6 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc nct6775 hwmon_vid snd_seq_dummy snd_seq_oss snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul snd_seq_midi snd_seq_virmidi snd_seq_midi_event snd_seq loop snd_hda_codec_realtek snd_hda_codec_generic usb_storage snd_hda_codec_hdmi joydev snd_emu10k1 snd_hda_intel snd_hwdep snd_util_mem snd_ac97_codec snd_hda_codec tg3 x86_pkg_temp_thermal emu10k1_gp 8250 ac97_bus gameport snd_hda_core coretemp snd_rawmidi 8250_base snd_seq_device i2c_i801 sr_mod serial_core snd_pcm snd_timer snd soundcore CPU: 0 PID: 1386 Comm: ext4lazyinit Not tainted 4.13.9+ #68 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Extreme4, BIOS P2.90 07/11/2013 task: ffff880409810000 task.stack: ffffc90003d94000 RIP: 0010:submit_bio+0xf/0x10e RSP: 0018:ffffc90003d97d90 EFLAGS: 00010246 RAX: ffffc90003d97da8 RBX: ffff8803b3eece40 RCX: 0140004003d97de8 RDX: ffffffff81dfb3d8 RSI: ffffffff81aae8f1 RDI: ffff8803b3eece40 RBP: ffffc90003d97dd0 R08: ffff8804075a2ca0 R09: 0000000000001000 R10: 0000000000001000 R11: ffff8803b3eece40 R12: ffffc90003d97d98 R13: ffff880408292040 R14: ffff880408d190c0 R15: 000000000000000b FS: 0000000000000000(0000) GS:ffff88041ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000174dc78 CR3: 0000000001c09000 CR4: 00000000000406f0 Call Trace: ? submit_bio_wait+0x4c/0x60 blkdev_issue_zeroout+0x6b/0x96 ext4_init_inode_table+0x1c2/0x2c6 ? ext4_init_inode_table+0x1c2/0x2c6 ext4_lazyinit_thread+0x123/0x2cd ? ext4_unregister_li_request.isra.8+0x57/0x57 kthread+0x115/0x11d ? kthread_create_on_node+0x3a/0x3a ret_from_fork+0x22/0x30 Code: 70 06 00 00 00 00 00 00 48 83 c4 20 44 89 e8 5b 41 5a 41 5c 41 5d 5d 49 8d 62 f8 c3 83 3d b8 91 b0 00 00 74 08 f6 47 19 04 75 02 <0f> 0b 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 20 48 85 ff 0f 84 RIP: submit_bio+0xf/0x10e RSP: ffffc90003d97d90 ---[ end trace 34445b2d81695622 ]--- ...80% Waited for i/o due to synchronous I/O 4 times. Waited for i/o due to throughput_throttle 5816 times. Suspending console(s) (use no_console_suspend to debug)
Again, vanilla hibernation works so far. Maybe this is some freezing/thawing issue or some threading issue?
IIRC, some encryption related things were taken out in the 4.13 based TOI, when compared with 4.12. I don't know the reasons for that step, maybe taking them back in could help you. Only @NigelCunningham can clarify that for sure.
This crash also occurrs to me sometimes and also occurred with kernel 4.10. I do not have any encrypted mounts.
The strange thing is that it happens in process loop0 in my case too. So in my hibernate scripts I unmounted my only loopback device and also logged the running processes after this and there is no loop0. However, it then re-appears for a reason during the actual hibernation...
Maybe there gets something remounted? I still get the lockup with just simple loopback mounts of files upto 4.14.6. But not with the systemd mounted things.
I have changed the BUG_ON
in blk-core.c
to WARN_ON
, and since then the system seems to be pretty stable. I can even leave the loopback filesystems mounted during hibernation.
So, I think, at least in the case of a loopback filesystem this call to submit_bio
is harmless. I am wondering if this check could be eliminated, or perhaps made more selective.