openvmm icon indicating copy to clipboard operation
openvmm copied to clipboard

[FLAKY TEST] openhcl linux direct reboot - `unable to handle page fault` on second boot

Open smalis-msft opened this issue 1 year ago • 4 comments

Example run https://github.com/microsoft/openvmm/actions/runs/11337175381/job/31528850443?pr=81

2024-10-15T01:11:59.758365Z  INFO linux_log: [   32.716659] usbcore: registered new interface driver usblp
2024-10-15T01:11:59.882830Z  INFO linux_log: [   32.776050] scsi 0:0:0:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
2024-10-15T01:11:59.952884Z  INFO linux_log: [   32.838490] usbcore: registered new interface driver usb-storage
2024-10-15T01:12:00.003502Z  INFO linux_log: [   33.034226] i8042: PNP: No PS/2 controller found.
2024-10-15T01:12:00.064171Z  INFO linux_log: [   33.083584] hv_vmbus: registering driver hyperv_keyboard
2024-10-15T01:12:00.180730Z  INFO linux_log: [   33.084020] BUG: unable to handle page fault for address: 000000000001e540
2024-10-15T01:12:00.748100Z  INFO linux_log: [   33.144323] #PF: supervisor read access in kernel mode
2024-10-15T01:12:00.806495Z  INFO linux_log: [   33.144323] #PF: error_code(0x0000) - not-present page
2024-10-15T01:12:00.845373Z  INFO linux_log: [   33.144323] PGD 0
2024-10-15T01:12:00.897926Z  INFO linux_log: [   33.818752] rtc_cmos 00:02: registered as rtc0
2024-10-15T01:12:00.929174Z  INFO linux_log: [   33.144323] P4D 0
2024-10-15T01:12:00.986809Z  INFO linux_log: [   33.144323] Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-10-15T01:12:01.090009Z  INFO linux_log: [   33.144323] CPU: 0 PID: 27 Comm: kworker/u4:1 Not tainted 6.1.74 #1
2024-10-15T01:12:01.173688Z  INFO linux_log: [   33.144323] Workqueue: events_unbound async_run_entry_fn
2024-10-15T01:12:01.276349Z  INFO linux_log: [   33.890512] rtc_cmos 00:02: alarms up to one day, 114 bytes nvram
2024-10-15T01:12:01.300277Z  INFO linux_log: [   33.144323]
2024-10-15T01:12:01.376829Z  INFO linux_log: [   33.144323] RIP: 0010:__kmem_cache_alloc_node+0x95/0x180
2024-10-15T01:12:01.432217Z  INFO linux_log: [   33.978229] fail to initialize ptp_kvm
2024-10-15T01:12:01.737329Z  INFO linux_log: [   33.144323] Code: 0f 84 c6 00 00 00 41 83 fd ff 74 10 48 8b 00 48 c1 e8 36 41 39 c5 0f 85 b0 00 00 00 8b 45 28 48 8b 7d 00 48 8d 8a 00 20 00 00 <49> 8b 1c 00 4c 89 c0 65 48 0f c7 0f 0f 94 c0 84 c0 74 a0 8b 45 28
2024-10-15T01:12:01.832158Z  INFO linux_log: [   33.144323] RSP: 0000:ffffc900000efa00 EFLAGS: 00010246
2024-10-15T01:12:01.953260Z  INFO linux_log: [   33.144323] RAX: 0000000000000040 RBX: ffff888100041700 RCX: 00000000002d6000
2024-10-15T01:12:02.061728Z  INFO linux_log: [   33.144323] RDX: 00000000002d4000 RSI: 0000000000000a20 RDI: 000000000002b050
2024-10-15T01:12:02.185163Z  INFO linux_log: [   33.144323] RBP: ffff888100041700 R08: 000000000001e500 R09: 0000000000000000
2024-10-15T01:12:02.304608Z  INFO linux_log: [   33.144323] R10: ffff8881008ec270 R11: 0000000000000000 R12: 0000000000000a20
2024-10-15T01:12:02.427232Z  INFO linux_log: [   33.144323] R13: 00000000ffffffff R14: ffffffff815c1632 R15: 0000000000000080
2024-10-15T01:12:02.563104Z  INFO linux_log: [   33.144323] FS:  0000000000000000(0000) GS:ffff888104a00000(0000) knlGS:0000000000000000
2024-10-15T01:12:02.661410Z  INFO linux_log: [   33.144323] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-10-15T01:12:02.780322Z  INFO linux_log: [   33.144323] CR2: 000000000001e540 CR3: 000000000220c001 CR4: 0000000000370ef0
2024-10-15T01:12:02.900652Z  INFO linux_log: [   33.144323] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-10-15T01:12:03.017726Z  INFO linux_log: [   33.144323] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2024-10-15T01:12:03.059540Z  INFO linux_log: [   33.144323] Call Trace:
2024-10-15T01:12:03.095769Z  INFO linux_log: [   33.144323]  <TASK>
2024-10-15T01:12:03.149511Z  INFO linux_log: [   33.144323]  ? __die+0x54/0x92
2024-10-15T01:12:03.218622Z  INFO linux_log: [   33.144323]  ? page_fault_oops+0xa9/0x1e0
2024-10-15T01:12:03.280800Z  INFO linux_log: [   33.144323]  ? vsnprintf+0x367/0x560
2024-10-15T01:12:03.350558Z  INFO linux_log: [   33.144323]  ? exc_page_fault+0x3ba/0x6c0
2024-10-15T01:12:03.420949Z  INFO linux_log: [   33.144323]  ? add_uevent_var+0x82/0x110
2024-10-15T01:12:03.498604Z  INFO linux_log: [   33.144323]  ? asm_exc_page_fault+0x22/0x30
2024-10-15T01:12:03.560791Z  INFO linux_log: [   33.144323]  ? sdev_prefix_printk+0x52/0x110
2024-10-15T01:12:03.630968Z  INFO linux_log: [   33.144323]  ? __kmem_cache_alloc_node+0x95/0x180
2024-10-15T01:12:03.693060Z  INFO linux_log: [   33.144323]  kmalloc_trace+0x20/0x90
2024-10-15T01:12:03.766044Z  INFO linux_log: [   33.144323]  sdev_prefix_printk+0x52/0x110
2024-10-15T01:12:03.836426Z  INFO linux_log: [   33.144323]  ? kernfs_create_link+0x58/0xa0
2024-10-15T01:12:03.900852Z  INFO linux_log: [   33.144323]  sg_add_device+0x216/0x2f0
2024-10-15T01:12:03.959548Z  INFO linux_log: [   33.144323]  device_add+0x499/0x810
2024-10-15T01:12:04.049613Z  INFO linux_log: [   33.144323]  ? _raw_spin_unlock_irqrestore+0x5/0x30
2024-10-15T01:12:04.123032Z  INFO linux_log: [   33.144323]  ? __pm_runtime_resume+0x53/0x80
2024-10-15T01:12:04.202200Z  INFO linux_log: [   33.144323]  scsi_sysfs_add_sdev+0x165/0x200
2024-10-15T01:12:04.284820Z  INFO linux_log: [   33.144323]  scsi_probe_and_add_lun+0x90c/0xb90
2024-10-15T01:12:04.356258Z  INFO linux_log: [   33.144323]  __scsi_scan_target+0xeb/0x4d0
2024-10-15T01:12:04.444605Z  INFO linux_log: [   33.144323]  ? _raw_spin_unlock_irqrestore+0x5/0x30
2024-10-15T01:12:04.518832Z  INFO linux_log: [   33.144323]  ? __pm_runtime_idle+0x43/0x100
2024-10-15T01:12:04.588511Z  INFO linux_log: [   33.144323]  scsi_scan_channel+0x53/0x90
2024-10-15T01:12:04.669202Z  INFO linux_log: [   33.144323]  scsi_scan_host_selected+0xcb/0x100
2024-10-15T01:12:04.738235Z  INFO linux_log: [   33.144323]  scsi_scan_host+0x181/0x1a0
2024-10-15T01:12:04.799598Z  INFO linux_log: [   33.144323]  storvsc_probe+0x292/0x440
2024-10-15T01:12:04.869317Z  INFO linux_log: [   33.144323]  ? kernfs_create_link+0x58/0xa0
2024-10-15T01:12:04.924077Z  INFO linux_log: [   33.144323]  vmbus_probe+0x36/0x50
2024-10-15T01:12:04.986316Z  INFO linux_log: [   33.144323]  really_probe+0xc4/0x280
2024-10-15T01:12:05.058945Z  INFO linux_log: [   33.144323]  ? pm_runtime_barrier+0x4b/0x80
2024-10-15T01:12:05.130450Z  INFO linux_log: [   33.144323]  __driver_probe_device+0x6e/0xf0
2024-10-15T01:12:05.199927Z  INFO linux_log: [   33.144323]  driver_probe_device+0x19/0xe0
2024-10-15T01:12:05.274774Z  INFO linux_log: [   33.144323]  __driver_attach_async_helper+0x4a/0xa0
2024-10-15T01:12:05.342364Z  INFO linux_log: [   33.144323]  async_run_entry_fn+0x1b/0xa0
2024-10-15T01:12:05.412797Z  INFO linux_log: [   33.144323]  process_one_work+0x1d7/0x3a0
2024-10-15T01:12:05.476728Z  INFO linux_log: [   33.144323]  worker_thread+0x48/0x3c0
2024-10-15T01:12:05.540354Z  INFO linux_log: [   33.144323]  ? process_one_work+0x3a0/0x3a0
2024-10-15T01:12:05.589087Z  INFO linux_log: [   33.144323]  kthread+0xd4/0x100
2024-10-15T01:12:05.652928Z  INFO linux_log: [   33.144323]  ? kthread_exit+0x30/0x30
2024-10-15T01:12:05.711098Z  INFO linux_log: [   33.144323]  ret_from_fork+0x22/0x30
2024-10-15T01:12:05.746151Z  INFO linux_log: [   33.144323]  </TASK>
2024-10-15T01:12:05.801918Z  INFO linux_log: [   33.144323] CR2: 000000000001e540
2024-10-15T01:12:05.881787Z  INFO linux_log: [   33.144323] ---[ end trace 0000000000000000 ]---

smalis-msft avatar Oct 15 '24 14:10 smalis-msft

hmm... if we tried an experiment where we zero memory for the whole guest, does it work? Or just zero out the whole range we load the linux kernel in? i wonder if we're missing some case in our loader where we need to zero out memory?

chris-oo avatar Nov 01 '24 13:11 chris-oo

This is still being seen, although very rarely.

smalis-msft avatar Mar 14 '25 17:03 smalis-msft

Still an issue

smalis-msft avatar Apr 16 '25 14:04 smalis-msft

Last time seen with

[     linux] [    1.468466]  <TASK>
[     linux] [    1.472563]  ? __die+0x54/0x92
[     linux] [    1.478016]  ? page_fault_oops+0xa9/0x1e0
[     linux] [    1.489578]  ? match_pci_dev_by_id+0x56/0x80
[     linux] [    1.502133]  ? fixup_exception+0x1d/0x310
[     linux] [    1.508344]  ? exc_page_fault+0x34a/0x6c0
[     linux] [    1.513106]  ? security_kernfs_init_security+0x25/0x40
[     linux] [    1.518099]  ? asm_exc_page_fault+0x22/0x30
[     linux] [    1.521943]  ? pci_dev_put+0x20/0x20
[     linux] [    1.525325]  ? pci_dev_put+0x20/0x20
[     linux] [    1.528529]  ? match_pci_dev_by_id+0x56/0x80
[     linux] [    1.532489]  ? pci_dev_put+0x20/0x20
[     linux] [    1.535880]  bus_find_device+0x77/0xc0
[     linux] [    1.539629]  pci_get_device+0x69/0xb0
[     linux] [    1.543009]  pci_sysfs_init+0x1d/0x6d
[     linux] [    1.547120]  ? pci_driver_init+0x21/0x21
[     linux] [    1.551510]  do_one_initcall+0x7c/0x16a
[     linux] [    1.556044]  kernel_init_freeable+0x1f3/0x25a
[     linux] [    1.562620]  ? rest_init+0xd0/0xd0
[     linux] [    1.566087]  kernel_init+0x11/0x120

romank-msft avatar May 09 '25 16:05 romank-msft

This has been hitting more and more often.

smalis-msft avatar Jun 23 '25 19:06 smalis-msft

Slightly different symptom this time https://openvmm.dev/test-results/test.html?run=17181332154&job=x64-windows-amd-vmm-tests-logs&test=multiarch__openvmm_openhcl_linux_x64_reboot_no_agent#

Unexpected reloc type in static binary. segfault at 10 ip 000000000040146d sp 00007fff6ffc7ca0 error 4 in sh[401000+268000] likely on CPU 1 (core 1, socket 0)

smalis-msft avatar Aug 25 '25 15:08 smalis-msft