linux icon indicating copy to clipboard operation
linux copied to clipboard

[BUG] kernel call trace error when testing kmod reload after playback on GLK chromebook

Open keqiaozhang opened this issue 4 years ago • 16 comments

Describe the bug This is the first time we have observed this issue. no not sure about the preproduction rate. no etrace available. only kernel log is saved. CI daily report ID 1547.

To Reproduce sof-test/test-case/check-kmod-load-unload-after-playback.sh -l 15

Reproduction Rate Only reproduce one in CI daily test. will trigger the test manually next to figure it out.

Platform GLK Chromebook

Kernel log

kernel: [10102.726814] Hardware name: Google Bobba/Bobba, BIOS Google_Bobba.11825.0.2019_03_06_2015 03/06/2019
kernel: [10102.726817] RIP: 0010:dapm_find_widget+0x71/0xd0 [snd_soc_core]
kernel: [10102.726848] Code: 00 48 89 e7 e8 40 73 83 f6 49 89 e4 49 8b 6d 18 45 31 f6 48 8b 85 00 04 00 00 48 81 c5 00 04 00 00 48 39 c5 48 8d 58 e8 74 26 <48> 8b 7b 08 4c 89 e6 e8 d3 21 83 f6 85 c0 75 09 4c 39 6b 28 74 1c
kernel: [10102.726852] RSP: 0018:ffffb09300467d80 EFLAGS: 00010202
kernel: [10102.726856] RAX: dead000000000100 RBX: dead0000000000e8 RCX: 0000000000000000
kernel: [10102.726859] RDX: 0000000000000001 RSI: ffffffffc05f3a5f RDI: ffffa04ad0a14141
kernel: [10102.726862] RBP: ffffffffc0cb64e0 R08: 0000000000000001 R09: 0000000000000001
kernel: [10102.726864] R10: ffffb09300467e38 R11: 0000000000000000 R12: ffffffffc05f3a5e
kernel: [10102.726867] R13: ffffa04ab4d9ad68 R14: 0000000000000000 R15: 0000000000000001
kernel: [10102.726870] FS:  0000000000000000(0000) GS:ffffa04afbc00000(0000) knlGS:0000000000000000
kernel: [10102.726873] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [10102.726876] CR2: 00005587fa513e70 CR3: 000000011b6fe000 CR4: 0000000000350ee0
kernel: [10102.726878] Call Trace:
kernel: [10102.726885]  ? __mutex_unlock_slowpath+0x3e/0x2a0
kernel: [10102.726894]  snd_soc_dapm_set_pin+0x18/0xa0 [snd_soc_core]
kernel: [10102.726916]  snd_soc_dapm_disable_pin+0x2a/0x50 [snd_soc_core]
kernel: [10102.726938]  da7219_aad_irq_thread+0x22f/0x270 [snd_soc_da7219]
kernel: [10102.726946]  ? irq_forced_thread_fn+0x70/0x70
kernel: [10102.726951]  irq_thread_fn+0x1c/0x60
kernel: [10102.726955]  ? irq_thread+0x9c/0x1a0
kernel: [10102.726958]  irq_thread+0x102/0x1a0
kernel: [10102.726961]  ? wake_threads_waitq+0x30/0x30
kernel: [10102.726965]  ? irq_thread_dtor+0x90/0x90
kernel: [10102.726968]  kthread+0x134/0x150
kernel: [10102.726973]  ? kthread_park+0x80/0x80
kernel: [10102.726976]  ret_from_fork+0x1f/0x30
kernel: [10102.726983] Modules linked in: snd_soc_sst_bxt_da7219_max98357a snd_soc_hdac_hdmi snd_hda_codec_hdmi snd_soc_dmic snd_sof_pci(-) snd_sof_intel_hda_common snd_sof_intel_hda snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec snd_hwdep snd_hda_core snd_sof_acpi snd_sof_intel_byt snd_sof_intel_bdw snd_sof_intel_ipc snd_sof snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence snd_soc_rt1011 snd_soc_rt715 snd_soc_rt1308_sdw snd_soc_rt1308 snd_soc_rt711 snd_soc_rt700 snd_soc_max98373_sdw snd_soc_max98373_i2c snd_soc_max98373 snd_soc_max98090 snd_soc_max98357a snd_soc_wm8804_i2c snd_soc_wm8804 snd_soc_pcm512x_i2c snd_soc_pcm512x snd_soc_rt5682_sdw regmap_sdw soundwire_bus snd_soc_rt5682_i2c snd_soc_rt5682 snd_soc_rt5677 snd_soc_rt5677_spi snd_soc_rt5670 snd_soc_rt5660 snd_soc_rt5651 snd_soc_rt5645 snd_soc_rt5640 snd_soc_rl6231 snd_soc_rt298 snd_soc_rt286 snd_soc_rl6347a snd_soc_da7219 snd_soc_da7213 snd_soc_core
kernel: [10102.727054]  snd_compress snd_pcm asix usbnet x86_pkg_temp_thermal intel_powerclamp ledtrig_audio snd_soc_sst_ipc snd_soc_sst_dsp i915 i2c_algo_bit drm_kms_helper syscopyarea processor_thermal_device sysfillrect sysimgblt processor_thermal_rfim fb_sys_fops processor_thermal_mbox intel_soc_dts_iosf regmap_i2c snd_usbmidi_lib elan_i2c int3403_thermal int340x_thermal_zone int3400_thermal acpi_thermal_rel snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore drm efivarfs mmc_block spi_pxa2xx_platform xhci_pci sdhci_pci cqhci intel_lpss_pci sdhci xhci_hcd intel_lpss mfd_core i2c_hid [last unloaded: snd_usb_audio]
kernel: [10102.727140] ---[ end trace 13419a076e17e424 ]---
kernel: [10102.727143] RIP: 0010:dapm_find_widget+0x71/0xd0 [snd_soc_core]
kernel: [10102.727165] Code: 00 48 89 e7 e8 40 73 83 f6 49 89 e4 49 8b 6d 18 45 31 f6 48 8b 85 00 04 00 00 48 81 c5 00 04 00 00 48 39 c5 48 8d 58 e8 74 26 <48> 8b 7b 08 4c 89 e6 e8 d3 21 83 f6 85 c0 75 09 4c 39 6b 28 74 1c
kernel: [10102.727169] RSP: 0018:ffffb09300467d80 EFLAGS: 00010202
kernel: [10102.727172] RAX: dead000000000100 RBX: dead0000000000e8 RCX: 0000000000000000
kernel: [10102.727174] RDX: 0000000000000001 RSI: ffffffffc05f3a5f RDI: ffffa04ad0a14141
kernel: [10102.727177] RBP: ffffffffc0cb64e0 R08: 0000000000000001 R09: 0000000000000001
kernel: [10102.727180] R10: ffffb09300467e38 R11: 0000000000000000 R12: ffffffffc05f3a5e
kernel: [10102.727182] R13: ffffa04ab4d9ad68 R14: 0000000000000000 R15: 0000000000000001
kernel: [10102.727185] FS:  0000000000000000(0000) GS:ffffa04afbc00000(0000) knlGS:0000000000000000
kernel: [10102.727188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [10102.727190] CR2: 00005587fa513e70 CR3: 000000011b6fe000 CR4: 0000000000350ee0
kernel: [10102.727205] BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: [10102.727208] #PF: supervisor instruction fetch in kernel mode
kernel: [10102.727211] #PF: error_code(0x0010) - not-present page
kernel: [10102.727214] PGD 0 P4D 0 
kernel: [10102.727218] Oops: 0010 [#2] SMP PTI
kernel: [10102.727221] CPU: 1 PID: 156379 Comm: irq/96-da7219-a Tainted: G      D           5.11.0-rc2-daily-default-20210111 #3
kernel: [10102.727225] Hardware name: Google Bobba/Bobba, BIOS Google_Bobba.11825.0.2019_03_06_2015 03/06/2019
kernel: [10102.727227] RIP: 0010:0x0
kernel: [10102.727232] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
kernel: [10102.727234] RSP: 0018:ffffb09300467ea8 EFLAGS: 00010286
kernel: [10102.727237] RAX: 0000000000000000 RBX: ffffb09300467ec8 RCX: 0000000000000000
kernel: [10102.727240] RDX: ffffffffb6a8dcd7 RSI: ffffffffb6a8dcdf RDI: ffffb09300467ec8
kernel: [10102.727243] RBP: ffffa04aa5a95040 R08: 0000000000000000 R09: 0000000000000001
kernel: [10102.727245] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa04aa5a95040
kernel: [10102.727247] R13: ffffffffb7d24238 R14: ffffa04aa5a95890 R15: ffffffffb8a05bb0
kernel: [10102.727250] FS:  0000000000000000(0000) GS:ffffa04afbc00000(0000) knlGS:0000000000000000
kernel: [10102.727253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [10102.727255] CR2: ffffffffffffffd6 CR3: 000000011b6fe000 CR4: 0000000000350ee0
kernel: [10102.727258] Call Trace:
kernel: [10102.727260]  task_work_run+0x6e/0xc0
kernel: [10102.727265]  do_exit+0x3f3/0xc80
kernel: [10102.727271]  ? irq_thread_dtor+0x90/0x90
kernel: [10102.727274]  ? kthread+0x134/0x150
kernel: [10102.727278]  rewind_stack_do_exit+0x17/0x20
kernel: [10102.727281] RIP: 0000:0x0
kernel: [10102.727284] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
kernel: [10102.727287] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
kernel: [10102.727290] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: [10102.727292] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
kernel: [10102.727295] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
kernel: [10102.727297] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [10102.727299] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
kernel: [10102.727303] Modules linked in: snd_soc_sst_bxt_da7219_max98357a snd_soc_hdac_hdmi snd_hda_codec_hdmi snd_soc_dmic snd_sof_pci(-) snd_sof_intel_hda_common snd_sof_intel_hda snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec snd_hwdep snd_hda_core snd_sof_acpi snd_sof_intel_byt snd_sof_intel_bdw snd_sof_intel_ipc snd_sof snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence snd_soc_rt1011 snd_soc_rt715 snd_soc_rt1308_sdw snd_soc_rt1308 snd_soc_rt711 snd_soc_rt700 snd_soc_max98373_sdw snd_soc_max98373_i2c snd_soc_max98373 snd_soc_max98090 snd_soc_max98357a snd_soc_wm8804_i2c snd_soc_wm8804 snd_soc_pcm512x_i2c snd_soc_pcm512x snd_soc_rt5682_sdw regmap_sdw soundwire_bus snd_soc_rt5682_i2c snd_soc_rt5682 snd_soc_rt5677 snd_soc_rt5677_spi snd_soc_rt5670 snd_soc_rt5660 snd_soc_rt5651 snd_soc_rt5645 snd_soc_rt5640 snd_soc_rl6231 snd_soc_rt298 snd_soc_rt286 snd_soc_rl6347a snd_soc_da7219 snd_soc_da7213 snd_soc_core
kernel: [10102.727357]  snd_compress snd_pcm asix usbnet x86_pkg_temp_thermal intel_powerclamp ledtrig_audio snd_soc_sst_ipc snd_soc_sst_dsp i915 i2c_algo_bit drm_kms_helper syscopyarea processor_thermal_device sysfillrect sysimgblt processor_thermal_rfim fb_sys_fops processor_thermal_mbox intel_soc_dts_iosf regmap_i2c snd_usbmidi_lib elan_i2c int3403_thermal int340x_thermal_zone int3400_thermal acpi_thermal_rel snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore drm efivarfs mmc_block spi_pxa2xx_platform xhci_pci sdhci_pci cqhci intel_lpss_pci sdhci xhci_hcd intel_lpss mfd_core i2c_hid [last unloaded: snd_usb_audio]
kernel: [10102.727407] CR2: 0000000000000000
kernel: [10102.727409] ---[ end trace 13419a076e17e425 ]---
kernel: [10102.727411] RIP: 0010:dapm_find_widget+0x71/0xd0 [snd_soc_core]
kernel: [10102.727432] Code: 00 48 89 e7 e8 40 73 83 f6 49 89 e4 49 8b 6d 18 45 31 f6 48 8b 85 00 04 00 00 48 81 c5 00 04 00 00 48 39 c5 48 8d 58 e8 74 26 <48> 8b 7b 08 4c 89 e6 e8 d3 21 83 f6 85 c0 75 09 4c 39 6b 28 74 1c
kernel: [10102.727436] RSP: 0018:ffffb09300467d80 EFLAGS: 00010202
kernel: [10102.727439] RAX: dead000000000100 RBX: dead0000000000e8 RCX: 0000000000000000
kernel: [10102.727442] RDX: 0000000000000001 RSI: ffffffffc05f3a5f RDI: ffffa04ad0a14141
kernel: [10102.727444] RBP: ffffffffc0cb64e0 R08: 0000000000000001 R09: 0000000000000001
kernel: [10102.727446] R10: ffffb09300467e38 R11: 0000000000000000 R12: ffffffffc05f3a5e
kernel: [10102.727449] R13: ffffa04ab4d9ad68 R14: 0000000000000000 R15: 0000000000000001
kernel: [10102.727451] FS:  0000000000000000(0000) GS:ffffa04afbc00000(0000) knlGS:0000000000000000
kernel: [10102.727454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [10102.727456] CR2: ffffffffffffffd6 CR3: 000000011b6fe000 CR4: 0000000000350ee0
kernel: [10102.727459] Fixing recursive fault but reboot is needed!

keqiaozhang avatar Jan 11 '21 06:01 keqiaozhang

Set P3 and only reproduced once.

mengdonglin avatar Feb 01 '21 02:02 mengdonglin

Caught in CI today: https://sof-ci.01.org/linuxpr/PR2747/build5278/devicetest/?model=GLK_BOB_DA7219&testcase=check-kmod-load-unload-after-playback Bumping to P2.

kv2019i avatar Feb 09 '21 10:02 kv2019i

This looks like a codec issue @kv2019i by looking at the trace?

plbossart avatar Feb 09 '21 16:02 plbossart

adding dmesg log here: dmesg.txt

plbossart avatar Feb 09 '21 16:02 plbossart

possible fix at https://github.com/thesofproject/linux/pull/2749

plbossart avatar Feb 09 '21 17:02 plbossart

Observed again today: https://sof-ci.01.org/softestpr/PR628/build606/devicetest/?model=GLK_BOB_DA7219&testcase=check-kmod-load-unload-after-playback

marc-hb avatar Mar 11 '21 06:03 marc-hb

From test database, it is happening occasionally. I'm trying with my GLK Bobba, but interestingly I couldn't reproduce this issue.

  1. check-kmod-load-unload-with-playback.sh -l 50
  2. check-kmod-load-unload.sh -l 50
  3. check-kmod-load-unload-with-playback.sh -l 100

I'm using today's latest tip.

"kernelBranch" : "topic/sof-dev",
"kernelCommit" : "82c1e1f8",
"sofBranch" : "master",
"sofCommit" : "fa35ee00f29c"

fredoh9 avatar Mar 12 '21 00:03 fredoh9

It's indeed very very hard to reproduce, I ran 1000+ iterations of that test on a Bobba device and nothing bad happened.

plbossart avatar Mar 12 '21 14:03 plbossart

looks like DA codec snd_soc_da7219 issue? mark it as codec and change it to P3.

keyonjie avatar Mar 22 '21 02:03 keyonjie

Internal daily test 3669 found this NULL pointer dereference again. Kernel Commit: da1f1790

fredoh9 avatar May 03 '21 17:05 fredoh9

@fredoh9 is this test 3669 with KASAN enabled? I see an address "dead000000000100", that's weird otherwise...

plbossart avatar May 03 '21 19:05 plbossart

@plbossart No this is default build. I'm running with KASAN enabled kernel build also, but so far working fine.

fredoh9 avatar May 03 '21 20:05 fredoh9

Error seen again in https://sof-ci.01.org/linuxpr/PR2978/build5863/devicetest/?model=GLK_BOB_DA7219&testcase=check-kmod-load-unload-after-playback

This means PR #2749 does not fully solve the problem. Gah, it seems we see this error once per month.

plbossart avatar Jun 10 '21 21:06 plbossart

Same error is found in weekly stress test. Test result 5068. 5068?model=GLK_BOB_DA7219&testcase=check-kmod-load-unload-after-playback-25

  • Kernel Branch: topic/sof-dev
  • Kernel Commit: 6bd3089a
  • SOF Branch: main
  • SOF Commit: a439ea93f64f

fredoh9 avatar Jul 07 '21 16:07 fredoh9

Error seen again:https://sof-ci.sh.intel.com/#/result/planresultdetail/10920?model=GLK_BOB_DA7219&testcase=check-kmod-load-unload-25

keqiaozhang avatar Mar 11 '22 01:03 keqiaozhang

This issue still occurs sporadically in CI, so I suggest to keep this issue open but with low priority.

keqiaozhang avatar Jun 30 '22 01:06 keqiaozhang

The last time I saw this issue in CI was in December of 2022. Closing this bug.

keqiaozhang avatar May 30 '23 07:05 keqiaozhang