[FLAKY TEST] openhcl linux direct reboot - `unable to handle page fault` on second boot
Example run https://github.com/microsoft/openvmm/actions/runs/11337175381/job/31528850443?pr=81
2024-10-15T01:11:59.758365Z INFO linux_log: [ 32.716659] usbcore: registered new interface driver usblp
2024-10-15T01:11:59.882830Z INFO linux_log: [ 32.776050] scsi 0:0:0:0: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 5
2024-10-15T01:11:59.952884Z INFO linux_log: [ 32.838490] usbcore: registered new interface driver usb-storage
2024-10-15T01:12:00.003502Z INFO linux_log: [ 33.034226] i8042: PNP: No PS/2 controller found.
2024-10-15T01:12:00.064171Z INFO linux_log: [ 33.083584] hv_vmbus: registering driver hyperv_keyboard
2024-10-15T01:12:00.180730Z INFO linux_log: [ 33.084020] BUG: unable to handle page fault for address: 000000000001e540
2024-10-15T01:12:00.748100Z INFO linux_log: [ 33.144323] #PF: supervisor read access in kernel mode
2024-10-15T01:12:00.806495Z INFO linux_log: [ 33.144323] #PF: error_code(0x0000) - not-present page
2024-10-15T01:12:00.845373Z INFO linux_log: [ 33.144323] PGD 0
2024-10-15T01:12:00.897926Z INFO linux_log: [ 33.818752] rtc_cmos 00:02: registered as rtc0
2024-10-15T01:12:00.929174Z INFO linux_log: [ 33.144323] P4D 0
2024-10-15T01:12:00.986809Z INFO linux_log: [ 33.144323] Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-10-15T01:12:01.090009Z INFO linux_log: [ 33.144323] CPU: 0 PID: 27 Comm: kworker/u4:1 Not tainted 6.1.74 #1
2024-10-15T01:12:01.173688Z INFO linux_log: [ 33.144323] Workqueue: events_unbound async_run_entry_fn
2024-10-15T01:12:01.276349Z INFO linux_log: [ 33.890512] rtc_cmos 00:02: alarms up to one day, 114 bytes nvram
2024-10-15T01:12:01.300277Z INFO linux_log: [ 33.144323]
2024-10-15T01:12:01.376829Z INFO linux_log: [ 33.144323] RIP: 0010:__kmem_cache_alloc_node+0x95/0x180
2024-10-15T01:12:01.432217Z INFO linux_log: [ 33.978229] fail to initialize ptp_kvm
2024-10-15T01:12:01.737329Z INFO linux_log: [ 33.144323] Code: 0f 84 c6 00 00 00 41 83 fd ff 74 10 48 8b 00 48 c1 e8 36 41 39 c5 0f 85 b0 00 00 00 8b 45 28 48 8b 7d 00 48 8d 8a 00 20 00 00 <49> 8b 1c 00 4c 89 c0 65 48 0f c7 0f 0f 94 c0 84 c0 74 a0 8b 45 28
2024-10-15T01:12:01.832158Z INFO linux_log: [ 33.144323] RSP: 0000:ffffc900000efa00 EFLAGS: 00010246
2024-10-15T01:12:01.953260Z INFO linux_log: [ 33.144323] RAX: 0000000000000040 RBX: ffff888100041700 RCX: 00000000002d6000
2024-10-15T01:12:02.061728Z INFO linux_log: [ 33.144323] RDX: 00000000002d4000 RSI: 0000000000000a20 RDI: 000000000002b050
2024-10-15T01:12:02.185163Z INFO linux_log: [ 33.144323] RBP: ffff888100041700 R08: 000000000001e500 R09: 0000000000000000
2024-10-15T01:12:02.304608Z INFO linux_log: [ 33.144323] R10: ffff8881008ec270 R11: 0000000000000000 R12: 0000000000000a20
2024-10-15T01:12:02.427232Z INFO linux_log: [ 33.144323] R13: 00000000ffffffff R14: ffffffff815c1632 R15: 0000000000000080
2024-10-15T01:12:02.563104Z INFO linux_log: [ 33.144323] FS: 0000000000000000(0000) GS:ffff888104a00000(0000) knlGS:0000000000000000
2024-10-15T01:12:02.661410Z INFO linux_log: [ 33.144323] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-10-15T01:12:02.780322Z INFO linux_log: [ 33.144323] CR2: 000000000001e540 CR3: 000000000220c001 CR4: 0000000000370ef0
2024-10-15T01:12:02.900652Z INFO linux_log: [ 33.144323] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-10-15T01:12:03.017726Z INFO linux_log: [ 33.144323] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024-10-15T01:12:03.059540Z INFO linux_log: [ 33.144323] Call Trace:
2024-10-15T01:12:03.095769Z INFO linux_log: [ 33.144323] <TASK>
2024-10-15T01:12:03.149511Z INFO linux_log: [ 33.144323] ? __die+0x54/0x92
2024-10-15T01:12:03.218622Z INFO linux_log: [ 33.144323] ? page_fault_oops+0xa9/0x1e0
2024-10-15T01:12:03.280800Z INFO linux_log: [ 33.144323] ? vsnprintf+0x367/0x560
2024-10-15T01:12:03.350558Z INFO linux_log: [ 33.144323] ? exc_page_fault+0x3ba/0x6c0
2024-10-15T01:12:03.420949Z INFO linux_log: [ 33.144323] ? add_uevent_var+0x82/0x110
2024-10-15T01:12:03.498604Z INFO linux_log: [ 33.144323] ? asm_exc_page_fault+0x22/0x30
2024-10-15T01:12:03.560791Z INFO linux_log: [ 33.144323] ? sdev_prefix_printk+0x52/0x110
2024-10-15T01:12:03.630968Z INFO linux_log: [ 33.144323] ? __kmem_cache_alloc_node+0x95/0x180
2024-10-15T01:12:03.693060Z INFO linux_log: [ 33.144323] kmalloc_trace+0x20/0x90
2024-10-15T01:12:03.766044Z INFO linux_log: [ 33.144323] sdev_prefix_printk+0x52/0x110
2024-10-15T01:12:03.836426Z INFO linux_log: [ 33.144323] ? kernfs_create_link+0x58/0xa0
2024-10-15T01:12:03.900852Z INFO linux_log: [ 33.144323] sg_add_device+0x216/0x2f0
2024-10-15T01:12:03.959548Z INFO linux_log: [ 33.144323] device_add+0x499/0x810
2024-10-15T01:12:04.049613Z INFO linux_log: [ 33.144323] ? _raw_spin_unlock_irqrestore+0x5/0x30
2024-10-15T01:12:04.123032Z INFO linux_log: [ 33.144323] ? __pm_runtime_resume+0x53/0x80
2024-10-15T01:12:04.202200Z INFO linux_log: [ 33.144323] scsi_sysfs_add_sdev+0x165/0x200
2024-10-15T01:12:04.284820Z INFO linux_log: [ 33.144323] scsi_probe_and_add_lun+0x90c/0xb90
2024-10-15T01:12:04.356258Z INFO linux_log: [ 33.144323] __scsi_scan_target+0xeb/0x4d0
2024-10-15T01:12:04.444605Z INFO linux_log: [ 33.144323] ? _raw_spin_unlock_irqrestore+0x5/0x30
2024-10-15T01:12:04.518832Z INFO linux_log: [ 33.144323] ? __pm_runtime_idle+0x43/0x100
2024-10-15T01:12:04.588511Z INFO linux_log: [ 33.144323] scsi_scan_channel+0x53/0x90
2024-10-15T01:12:04.669202Z INFO linux_log: [ 33.144323] scsi_scan_host_selected+0xcb/0x100
2024-10-15T01:12:04.738235Z INFO linux_log: [ 33.144323] scsi_scan_host+0x181/0x1a0
2024-10-15T01:12:04.799598Z INFO linux_log: [ 33.144323] storvsc_probe+0x292/0x440
2024-10-15T01:12:04.869317Z INFO linux_log: [ 33.144323] ? kernfs_create_link+0x58/0xa0
2024-10-15T01:12:04.924077Z INFO linux_log: [ 33.144323] vmbus_probe+0x36/0x50
2024-10-15T01:12:04.986316Z INFO linux_log: [ 33.144323] really_probe+0xc4/0x280
2024-10-15T01:12:05.058945Z INFO linux_log: [ 33.144323] ? pm_runtime_barrier+0x4b/0x80
2024-10-15T01:12:05.130450Z INFO linux_log: [ 33.144323] __driver_probe_device+0x6e/0xf0
2024-10-15T01:12:05.199927Z INFO linux_log: [ 33.144323] driver_probe_device+0x19/0xe0
2024-10-15T01:12:05.274774Z INFO linux_log: [ 33.144323] __driver_attach_async_helper+0x4a/0xa0
2024-10-15T01:12:05.342364Z INFO linux_log: [ 33.144323] async_run_entry_fn+0x1b/0xa0
2024-10-15T01:12:05.412797Z INFO linux_log: [ 33.144323] process_one_work+0x1d7/0x3a0
2024-10-15T01:12:05.476728Z INFO linux_log: [ 33.144323] worker_thread+0x48/0x3c0
2024-10-15T01:12:05.540354Z INFO linux_log: [ 33.144323] ? process_one_work+0x3a0/0x3a0
2024-10-15T01:12:05.589087Z INFO linux_log: [ 33.144323] kthread+0xd4/0x100
2024-10-15T01:12:05.652928Z INFO linux_log: [ 33.144323] ? kthread_exit+0x30/0x30
2024-10-15T01:12:05.711098Z INFO linux_log: [ 33.144323] ret_from_fork+0x22/0x30
2024-10-15T01:12:05.746151Z INFO linux_log: [ 33.144323] </TASK>
2024-10-15T01:12:05.801918Z INFO linux_log: [ 33.144323] CR2: 000000000001e540
2024-10-15T01:12:05.881787Z INFO linux_log: [ 33.144323] ---[ end trace 0000000000000000 ]---
hmm... if we tried an experiment where we zero memory for the whole guest, does it work? Or just zero out the whole range we load the linux kernel in? i wonder if we're missing some case in our loader where we need to zero out memory?
This is still being seen, although very rarely.
Still an issue
Last time seen with
[ linux] [ 1.468466] <TASK>
[ linux] [ 1.472563] ? __die+0x54/0x92
[ linux] [ 1.478016] ? page_fault_oops+0xa9/0x1e0
[ linux] [ 1.489578] ? match_pci_dev_by_id+0x56/0x80
[ linux] [ 1.502133] ? fixup_exception+0x1d/0x310
[ linux] [ 1.508344] ? exc_page_fault+0x34a/0x6c0
[ linux] [ 1.513106] ? security_kernfs_init_security+0x25/0x40
[ linux] [ 1.518099] ? asm_exc_page_fault+0x22/0x30
[ linux] [ 1.521943] ? pci_dev_put+0x20/0x20
[ linux] [ 1.525325] ? pci_dev_put+0x20/0x20
[ linux] [ 1.528529] ? match_pci_dev_by_id+0x56/0x80
[ linux] [ 1.532489] ? pci_dev_put+0x20/0x20
[ linux] [ 1.535880] bus_find_device+0x77/0xc0
[ linux] [ 1.539629] pci_get_device+0x69/0xb0
[ linux] [ 1.543009] pci_sysfs_init+0x1d/0x6d
[ linux] [ 1.547120] ? pci_driver_init+0x21/0x21
[ linux] [ 1.551510] do_one_initcall+0x7c/0x16a
[ linux] [ 1.556044] kernel_init_freeable+0x1f3/0x25a
[ linux] [ 1.562620] ? rest_init+0xd0/0xd0
[ linux] [ 1.566087] kernel_init+0x11/0x120
This has been hitting more and more often.
Slightly different symptom this time https://openvmm.dev/test-results/test.html?run=17181332154&job=x64-windows-amd-vmm-tests-logs&test=multiarch__openvmm_openhcl_linux_x64_reboot_no_agent#
Unexpected reloc type in static binary. segfault at 10 ip 000000000040146d sp 00007fff6ffc7ca0 error 4 in sh[401000+268000] likely on CPU 1 (core 1, socket 0)