linux icon indicating copy to clipboard operation
linux copied to clipboard

kernel BUG on TGL while running suspend-resume-with-audio

Open lyakh opened this issue 1 year ago • 4 comments

tgl.dmesg.txt A kernel BUG hit in cnl_ipc4_send_msg()

[  657.980098] BUG: unable to handle page fault for address: ffffc900025e8000
[  657.980102] #PF: supervisor read access in kernel mode
[  657.980104] #PF: error_code(0x0000) - not-present page
[  657.980106] PGD 103400067 P4D 103400067 PUD 103647067 PMD 14103e067 PTE 0
[  657.980112] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  657.980115] CPU: 0 PID: 1319 Comm: irq/135-AudioDS Not tainted 6.7.0-rc3-gf39a3c2a6f73 #dev
[  657.980119] Hardware name: AAEON UPX-TGL01/UPX-TGL01, BIOS UXTGBM14 11/05/2021
[  657.980120] RIP: 0010:memcpy_toio+0x27/0x50
[  657.980127] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 85 d2 74 28 40 f6 c7 01 75 2f 48 83 fa 01 76 06 40 f6 c7 02 75 1b 48 89 d1 48 c1 e9 02 <f3> a5 f6 c2 02 74 02 66 a5 f6 c2 01 74 01 a4 c3 cc cc cc cc 66 a5
[  657.980129] RSP: 0018:ffffc90001b33de8 EFLAGS: 00010a06
[  657.980132] RAX: ffff88814e29b828 RBX: ffffc900025e7d20 RCX: 3fffe220504b1f5c
[  657.980134] RDX: ffff8881412c8000 RSI: ffffc900025e8000 RDI: ffffc900033a0290
[  657.980135] RBP: ffff88814e29b828 R08: ffffc900025e7d70 R09: 0000000000000000
[  657.980136] R10: 0000000000000001 R11: 00000000000027c3 R12: ffff8881073760e0
[  657.980138] R13: ffff888143f47028 R14: ffffffffaf91f830 R15: ffffffffaf91fb69
[  657.980139] FS:  0000000000000000(0000) GS:ffff888283800000(0000) knlGS:0000000000000000
[  657.980141] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  657.980143] CR2: ffffc900025e8000 CR3: 000000015ba08001 CR4: 0000000000f70ef0
[  657.980145] PKRU: 55555554
[  657.980146] Call Trace:
[  657.980149]  <TASK>
[  657.980152]  ? __die+0x24/0x70
[  657.980157]  ? page_fault_oops+0x15b/0x440
[  657.980162]  ? fixup_exception+0x26/0x350
[  657.980166]  ? exc_page_fault+0xea/0x1a0
[  657.980173]  ? asm_exc_page_fault+0x26/0x30
[  657.980177]  ? irq_thread+0xb9/0x1d0
[  657.980183]  ? __pfx_irq_thread_fn+0x10/0x10
[  657.980188]  ? memcpy_toio+0x27/0x50
[  657.980192]  cnl_ipc4_send_msg+0xe6/0x110 [snd_sof_pci_intel_cnl]
[  657.980202]  cnl_ipc4_irq_thread+0x16e/0x380 [snd_sof_pci_intel_cnl]
[  657.980210]  hda_dsp_interrupt_thread+0xbe/0x1a0 [snd_sof_intel_hda_generic]
[  657.980218]  irq_thread_fn+0x21/0x60
[  657.980223]  irq_thread+0xff/0x1d0
[  657.980227]  ? __pfx_irq_thread+0x10/0x10
[  657.980231]  ? __pfx_irq_thread_dtor+0x10/0x10
[  657.980236]  ? __pfx_irq_thread+0x10/0x10
[  657.980240]  kthread+0xe8/0x120
[  657.980245]  ? __pfx_kthread+0x10/0x10
[  657.980249]  ret_from_fork+0x31/0x50
[  657.980252]  ? __pfx_kthread+0x10/0x10
[  657.980255]  ret_from_fork_asm+0x1b/0x30
[  657.980262]  </TASK>
[  657.980263] Modules linked in: snd_sof_ipc_msg_injector snd_sof_probes snd_sof_nocodec snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_hda_ext_core snd_soc_core snd_compress snd_soc_acpi_intel_match snd_soc_acpi snd_intel_dspcfg snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore intel_rapl_common wmi_bmof x86_pkg_temp_thermal intel_powerclamp i915 i2c_algo_bit mei_me drm_buddy ttm mei drm_display_helper drm_kms_helper video wmi intel_pmc_core squashfs fuse drm efivarfs e1000e intel_lpss_pci xhci_pci intel_lpss xhci_hcd idma64 mfd_core
[  657.980309] CR2: ffffc900025e8000
[  657.980311] ---[ end trace 0000000000000000 ]---

lyakh avatar Jan 17 '24 13:01 lyakh

[  657.562952] [1610] sof-audio-pci-intel-tgl 0000:00:1f.3: IMR restore supported, booting from IMR directly
...
[  657.864038] [1610] sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 timedout
...
[  657.864529] sof-audio-pci-intel-tgl 0000:00:1f.3: IMR restore failed, trying to cold boot
...
[  657.925553] [1610] sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 successful
...
[  657.948089] [1610] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware download successful, booting...

Then the crash, which is I think when the FW_READY is received and we try to send the first IPC and copy things to the mailbox. The mailbox is just not there and the kernel panics? - I would guess the message must have been a set_large for mtrace.

somehow the IMR lost it's content and that's why we fail the IMR boot:

[  657.562952] [1610] sof-audio-pci-intel-tgl 0000:00:1f.3: IMR restore supported, booting from IMR directly
[  657.864038] [1610] sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 timedout

This is really strange

ujfalusi avatar Jan 17 '24 13:01 ujfalusi

@lyakh where was this problem detected, CI or your local setup?

plbossart avatar Jan 19 '24 16:01 plbossart

@lyakh where was this problem detected, CI or your local setup?

@plbossart it was in my own setup

lyakh avatar Jan 22 '24 08:01 lyakh

@lyakh, is there anything we can do about this? If the firmware messes up the SRAM window and it is not accessible from host, there is not much the kernel can do. We don't know that up to the point when trying to access to it which is already late.

ujfalusi avatar Feb 06 '24 07:02 ujfalusi