linux icon indicating copy to clipboard operation
linux copied to clipboard

Kernel oops on Helios Chromebook

Open plbossart opened this issue 4 years ago • 23 comments

Kernel oops detected by Intel test 4819 on CML_HEL_RT5682

Test Case URL: https://github.com/thesofproject/sof-test/tree/master/test-case/check-kmod-load-unload.sh cmd: TPLG=sof-cml-rt1011-rt5682.tplg ~/sof-test/test-case/check-kmod-load-unload.sh -l 25

dmesg.log

[ 5993.111452] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x40010000: GLB_PM_MSG: CTX_SAVE
[ 5993.111689] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x20140000 successful
[ 5993.189809] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1010f0f successful
[ 5993.190413] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f successful
[ 5993.190421] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask 1
[ 5994.688732] kernel: BUG: unable to handle page fault for address: ffffffffc0978200
[ 5994.688742] kernel: #PF: supervisor write access in kernel mode
[ 5994.688747] kernel: #PF: error_code(0x0002) - not-present page
[ 5994.688751] kernel: PGD 1efe27067 P4D 1efe27067 PUD 1efe29067 PMD 10ce6f067 PTE 0
[ 5994.688760] kernel: Oops: 0002 [#1] SMP NOPTI
[ 5994.688766] kernel: CPU: 2 PID: 80996 Comm: rmmod Not tainted 5.13.0-rc5-pr2989-5919-default #3c0c1a01
[ 5994.688772] kernel: Hardware name: Google Helios/Helios, BIOS  01/21/2020
[ 5994.688775] kernel: RIP: 0010:clk_unregister+0x189/0x240
[ 5994.688785] kernel: Code: 00 48 85 c0 75 de 49 8b 41 08 49 83 c1 08 48 85 c0 75 c9 48 8b 45 00 48 8b 90 c0 00 00 00 48 85 d2 74 2d 48 8b 88 b8 00 00 00 <48> 89 0a 48 85 c9 74 04 48 89 51 08 48 c7 80 b8 00 00 00 00 00 00
[ 5994.688790] kernel: RSP: 0018:ffffa30680777e08 EFLAGS: 00010286
[ 5994.688795] kernel: RAX: ffffa07637729f00 RBX: ffffa07637729f00 RCX: ffffffff9ef85280
[ 5994.688799] kernel: RDX: ffffffffc0978200 RSI: ffffa07637729f00 RDI: ffffa076024e7400
[ 5994.688802] kernel: RBP: ffffa07615958ba0 R08: ffffa076024e7400 R09: ffffffff9e5fda50
[ 5994.688805] kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffffa30680777e20
[ 5994.688808] kernel: R13: 000000000000000f R14: ffffa07600ae9fc0 R15: 0000000000000000
[ 5994.688811] kernel: FS:  00007f1516db6540(0000) GS:ffffa07756200000(0000) knlGS:0000000000000000
[ 5994.688816] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5994.688820] kernel: CR2: ffffffffc0978200 CR3: 000000013a8d6002 CR4: 00000000003706e0
[ 5994.688824] kernel: Call Trace:
[ 5994.688829] kernel:  release_nodes+0x1a9/0x1f0
[ 5994.688840] kernel:  i2c_device_remove+0x37/0xb0
[ 5994.688848] kernel:  device_release_driver_internal+0xee/0x1d0
[ 5994.688853] kernel:  driver_detach+0x42/0x90
[ 5994.688858] kernel:  bus_remove_driver+0x56/0xd0
[ 5994.688866] kernel:  i2c_del_driver+0x19/0x40
[ 5994.688871] kernel:  __x64_sys_delete_module+0x18c/0x250
[ 5994.688881] kernel:  do_syscall_64+0x3d/0x80
[ 5994.688889] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 5994.688897] kernel: RIP: 0033:0x7f1516f02bcb
[ 5994.688901] kernel: Code: 73 01 c3 48 8b 0d c5 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 82 0c 00 f7 d8 64 89 01 48
[ 5994.688906] kernel: RSP: 002b:00007ffe077397c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 5994.688911] kernel: RAX: ffffffffffffffda RBX: 0000555ac7f2c790 RCX: 00007f1516f02bcb
[ 5994.688915] kernel: RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000555ac7f2c7f8
[ 5994.688918] kernel: RBP: 00007ffe07739828 R08: 0000000000000000 R09: 0000000000000000
[ 5994.688921] kernel: R10: 00007f1516f7eac0 R11: 0000000000000206 R12: 00007ffe07739a00
[ 5994.688924] kernel: R13: 00007ffe0773adf8 R14: 0000555ac7f2c2a0 R15: 0000555ac7f2c790
[ 5994.688929] kernel: Modules linked in: snd_soc_hdac_hdmi snd_hda_codec_hdmi snd_soc_dmic soundwire_intel soundwire_generic_allocation soundwire_cadence snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec snd_hwdep snd_hda_core snd_soc_acpi snd_intel_dspcfg snd_intel_sdw_acpi snd_soc_max98373_sdw snd_soc_max98373_i2c snd_soc_max98373 snd_soc_max98090 snd_soc_max98357a snd_soc_wm8804_i2c snd_soc_wm8804 regmap_sdw soundwire_bus snd_soc_rt5682_i2c(-) snd_soc_rt5682 snd_soc_rl6231 snd_soc_rl6347a snd_soc_core snd_compress snd_pcm fuse ax88179_178a usbnet snd_usbmidi_lib intel_pmc_core_pltdrv intel_pmc_core x86_pkg_temp_thermal ledtrig_audio i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops regmap_i2c intel_pch_thermal snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore elan_i2c int3403_thermal int340x_thermal_zone int3400_thermal acpi_thermal_rel drm efivarfs spi_pxa2xx_platform intel_lpss_pci xhci_pci intel_lpss idma64 xhci_hcd
[ 5994.689015] kernel:  [last unloaded: snd_soc_rt5682_sdw]
[ 5994.689022] kernel: CR2: ffffffffc0978200
[ 5994.689025] kernel: ---[ end trace 76680917a2530f8b ]---
[ 5994.689029] kernel: RIP: 0010:clk_unregister+0x189/0x240
[ 5994.689035] kernel: Code: 00 48 85 c0 75 de 49 8b 41 08 49 83 c1 08 48 85 c0 75 c9 48 8b 45 00 48 8b 90 c0 00 00 00 48 85 d2 74 2d 48 8b 88 b8 00 00 00 <48> 89 0a 48 85 c9 74 04 48 89 51 08 48 c7 80 b8 00 00 00 00 00 00
[ 5994.689039] kernel: RSP: 0018:ffffa30680777e08 EFLAGS: 00010286
[ 5994.689043] kernel: RAX: ffffa07637729f00 RBX: ffffa07637729f00 RCX: ffffffff9ef85280
[ 5994.689046] kernel: RDX: ffffffffc0978200 RSI: ffffa07637729f00 RDI: ffffa076024e7400
[ 5994.689049] kernel: RBP: ffffa07615958ba0 R08: ffffa076024e7400 R09: ffffffff9e5fda50
[ 5994.689052] kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffffa30680777e20
[ 5994.689055] kernel: R13: 000000000000000f R14: ffffa07600ae9fc0 R15: 0000000000000000
[ 5994.689058] kernel: FS:  00007f1516db6540(0000) GS:ffffa07756200000(0000) knlGS:0000000000000000
[ 5994.689062] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5994.689065] kernel: CR2: ffffffffc0978200 CR3: 000000013a8d6002 CR4: 00000000003706e0
[ 5994.689070] kernel: BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[ 5994.689073] kernel: in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 80996, name: rmmod
[ 5994.689078] kernel: INFO: lockdep is turned off.
[ 5994.689080] kernel: CPU: 2 PID: 80996 Comm: rmmod Tainted: G      D           5.13.0-rc5-pr2989-5919-default #3c0c1a01
[ 5994.689085] kernel: Hardware name: Google Helios/Helios, BIOS  01/21/2020
[ 5994.689088] kernel: Call Trace:
[ 5994.689091] kernel:  dump_stack+0x64/0x7c
[ 5994.689098] kernel:  ___might_sleep.cold+0x95/0xa2
[ 5994.689106] kernel:  exit_signals+0x2b/0x220
[ 5994.689115] kernel:  do_exit+0xc2/0xb50
[ 5994.689123] kernel:  rewind_stack_do_exit+0x17/0x20
[ 5994.689129] kernel: RIP: 0033:0x7f1516f02bcb
[ 5994.689133] kernel: Code: 73 01 c3 48 8b 0d c5 82 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 82 0c 00 f7 d8 64 89 01 48
[ 5994.689137] kernel: RSP: 002b:00007ffe077397c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 5994.689142] kernel: RAX: ffffffffffffffda RBX: 0000555ac7f2c790 RCX: 00007f1516f02bcb
[ 5994.689145] kernel: RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000555ac7f2c7f8
[ 5994.689148] kernel: RBP: 00007ffe07739828 R08: 0000000000000000 R09: 0000000000000000
[ 5994.689151] kernel: R10: 00007f1516f7eac0 R11: 0000000000000206 R12: 00007ffe07739a00
[ 5994.689153] kernel: R13: 00007ffe0773adf8 R14: 0000555ac7f2c2a0 R15: 0000555ac7f2c790

@sathya-nujella @naveen-manohar @cujomalainey @bzhg @bardliao @oder-chiou @shumingfan FYI probably unrelated to SOF, probably codec or machine driver issue.

plbossart avatar Jun 23 '21 15:06 plbossart

This error happens while unloading snd_soc_rt5682_i2c

RMMOD	snd_sof_pci_intel_tng
RMMOD	snd_sof_pci_intel_apl
RMMOD	snd_sof_pci_intel_cnl
RMMOD	snd_sof_pci_intel_icl
RMMOD	snd_sof_pci_intel_tgl
RMMOD	snd_sof_acpi_intel_byt
RMMOD	snd_sof_acpi_intel_bdw
RMMOD	snd_sof_intel_hda_common
RMMOD	snd_sof_acpi
RMMOD	snd_sof_pci
RMMOD	snd_sof_intel_atom
SKIP	snd_soc_catpt  	not loaded
SKIP	snd_intel_sst_acpi  	not loaded
SKIP	snd_intel_sst_core  	not loaded
SKIP	snd_soc_sst_atom_hifi2_platform  	not loaded
SKIP	snd_soc_skl  	not loaded
SKIP	snd_sof_intel_byt  	not loaded
SKIP	snd_sof_intel_bdw  	not loaded
SKIP	snd_sof_intel_hda_common  	not loaded
RMMOD	snd_sof_intel_hda
RMMOD	snd_sof_intel_ipc
RMMOD	snd_sof_xtensa_dsp
RMMOD	snd_soc_acpi_intel_match
SKIP	snd_soc_sof_rt5682  	not loaded
SKIP	snd_soc_sof_da7219_max98373  	not loaded
SKIP	snd_soc_sst_bdw_rt5677_mach  	not loaded
SKIP	snd_soc_sst_broadwell  	not loaded
SKIP	snd_soc_sst_bxt_da7219_max98357a  	not loaded
SKIP	snd_soc_sst_sof_pcm512x  	not loaded
SKIP	snd_soc_sst_bxt_rt298  	not loaded
SKIP	snd_soc_sst_sof_wm8804  	not loaded
SKIP	snd_soc_sst_byt_cht_da7213  	not loaded
SKIP	snd_soc_sst_byt_cht_es8316  	not loaded
SKIP	snd_soc_sst_bytcr_rt5640  	not loaded
SKIP	snd_soc_sst_bytcr_rt5651  	not loaded
SKIP	snd_soc_sst_cht_bsw_max98090_ti  	not loaded
SKIP	snd_soc_sst_cht_bsw_nau8824  	not loaded
SKIP	snd_soc_sst_cht_bsw_rt5645  	not loaded
SKIP	snd_soc_sst_cht_bsw_rt5672  	not loaded
SKIP	snd_soc_sst_glk_rt5682_max98357a  	not loaded
RMMOD	snd_soc_cml_rt1011_rt5682
SKIP	snd_soc_skl_hda_dsp  	not loaded
SKIP	snd_soc_sdw_rt700  	not loaded
SKIP	snd_soc_sdw_rt711_rt1308_rt715  	not loaded
SKIP	snd_soc_sof_sdw  	not loaded
SKIP	snd_soc_ehl_rt5660  	not loaded
RMMOD	snd_soc_intel_hda_dsp_common
SKIP	snd_soc_intel_sof_maxim_common  	not loaded
SKIP	snd_sof_ipc_test  	not loaded
SKIP	snd_sof_probes  	not loaded
SKIP	snd_sof_intel_client  	not loaded
SKIP	snd_sof_client  	not loaded
SKIP	snd_sof_nocodec  	not loaded
RMMOD	snd_sof
SKIP	snd_sof_nocodec  	not loaded
RMMOD	snd_soc_da7213
RMMOD	snd_soc_da7219
RMMOD	snd_soc_pcm512x_i2c
RMMOD	snd_soc_pcm512x
SKIP	snd_soc_rt274  	not loaded
RMMOD	snd_soc_rt286
RMMOD	snd_soc_rt298
RMMOD	snd_soc_rt700
RMMOD	snd_soc_rt711
RMMOD	snd_soc_rt1308
RMMOD	snd_soc_rt1308_sdw
RMMOD	snd_soc_rt715
SKIP	snd_soc_rt711_sdca  	not loaded
SKIP	snd_soc_rt1316_sdw  	not loaded
SKIP	snd_soc_rt715_sdca  	not loaded
SKIP	snd_soc_sdw_mockup  	not loaded
RMMOD	snd_soc_rt1011
RMMOD	snd_soc_rt5640
RMMOD	snd_soc_rt5645
RMMOD	snd_soc_rt5651
RMMOD	snd_soc_rt5660
RMMOD	snd_soc_rt5670
RMMOD	snd_soc_rt5677
RMMOD	snd_soc_rt5677_spi
RMMOD	snd_soc_rt5682_sdw
RMMOD	snd_soc_rt5682_i2c
/home/ubuntu/sof-test/tools/kmod/sof_remove.sh: line 5: 80995 Killed                  sudo rmmod "$MODULE"
snd_soc_hdac_hdmi      45056  0
snd_hda_codec_hdmi     73728  0
snd_soc_dmic           16384  0
soundwire_intel        40960  0
soundwire_generic_allocation    16384  1 soundwire_intel
soundwire_cadence      36864  1 soundwire_intel
snd_soc_hdac_hda       24576  0
snd_hda_ext_core       36864  2 snd_soc_hdac_hdmi,snd_soc_hdac_hda
snd_hda_codec         167936  2 snd_hda_codec_hdmi,snd_soc_hdac_hda
snd_hwdep              16384  1 snd_hda_codec
snd_hda_core          106496  5 snd_hda_codec_hdmi,snd_hda_ext_core,snd_hda_codec,snd_soc_hdac_hdmi,snd_soc_hdac_hda
snd_soc_acpi           16384  0
snd_intel_dspcfg       28672  0
snd_intel_sdw_acpi     20480  1 snd_intel_dspcfg
snd_soc_max98373_sdw    20480  0
snd_soc_max98373_i2c    20480  0
snd_soc_max98373       40960  2 snd_soc_max98373_i2c,snd_soc_max98373_sdw
snd_soc_max98090      114688  0
snd_soc_max98357a      20480  0
snd_soc_wm8804_i2c     16384  0
snd_soc_wm8804         20480  1 snd_soc_wm8804_i2c
regmap_sdw             16384  1 snd_soc_max98373_sdw
soundwire_bus          94208  5 regmap_sdw,soundwire_intel,soundwire_generic_allocation,soundwire_cadence,snd_soc_max98373_sdw
snd_soc_rt5682_i2c     16384  -1
snd_soc_rt5682         94208  1 snd_soc_rt5682_i2c
snd_soc_rl6231         20480  1 snd_soc_rt5682
snd_soc_rl6347a        16384  0
snd_soc_core          323584  12 snd_soc_max98373,snd_soc_wm8804,snd_soc_rt5682,soundwire_intel,snd_soc_hdac_hdmi,snd_soc_hdac_hda,snd_soc_max98373_i2c,snd_soc_max98090,snd_soc_max98357a,snd_soc_max98373_sdw,snd_soc_dmic,snd_soc_rt5682_i2c
snd_compress           32768  1 snd_soc_core
snd_pcm               155648  12 snd_hda_codec_hdmi,snd_soc_wm8804,snd_soc_rt5682,snd_hda_codec,soundwire_intel,snd_soc_hdac_hdmi,snd_compress,snd_soc_max98373_i2c,snd_soc_max98090,snd_soc_core,snd_soc_max98373_sdw,snd_hda_core
snd_usbmidi_lib        40960  0
drm_kms_helper        278528  1 i915
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
regmap_i2c             16384  4 snd_soc_wm8804_i2c,snd_soc_max98373_i2c,snd_soc_max98090,snd_soc_rt5682_i2c
snd_seq_midi           20480  0
snd_seq_midi_event     16384  1 snd_seq_midi
snd_rawmidi            45056  2 snd_seq_midi,snd_usbmidi_lib
snd_seq                86016  2 snd_seq_midi,snd_seq_midi_event
snd_seq_device         16384  3 snd_seq,snd_seq_midi,snd_rawmidi
snd_timer              40960  2 snd_seq,snd_pcm
snd                   102400  12 snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_usbmidi_lib,snd_hda_codec,snd_timer,snd_soc_hdac_hdmi,snd_compress,snd_soc_core,snd_pcm,snd_rawmidi
soundcore              16384  1 snd
drm                   552960  5 drm_kms_helper,i915
/home/ubuntu/sof-test/tools/kmod/sof_remove.sh FAILED

plbossart avatar Jun 23 '21 15:06 plbossart

@plbossart
[ 5994.688772] kernel: Hardware name: Google Helios/Helios, BIOS 01/21/2020 [ 5994.688775] kernel: RIP: 0010:clk_unregister+0x189/0x240 It seems the resource was released before when the snd_soc_rt5682_i2c unloads.

Could you enable devres_log to get more information before the script test?

# cd /sys/module/devres/parameters/
# echo 1 > log

shumingfan avatar Jun 24 '21 07:06 shumingfan

good suggestion @shumingfan. We've only seen this issue once so it's unlikely to reproduce immediately.

@fredoh9 can we add this to sof-dyndbg.conf

options devres log=1

It seems we already have CONFIG_DEBUG_DEVRES=y

plbossart avatar Jun 24 '21 16:06 plbossart

Sure, i will add that and collect more logs

fredoh9 avatar Jun 24 '21 17:06 fredoh9

Easy to reproduce with latest builds.

    "version" : {
        "kernelBranch" : "topic/sof-dev",
        "kernelCommit" : "695d444a",
        "sofBranch" : "main",
        "sofCommit" : "86c9a037e212",
        "firmwareType" : "xcc"
    },

cmdline.txt 0624_hel_dmesg.log

@plbossart do we want this devres log=1 for all SOF CI devices?

fredoh9 avatar Jun 24 '21 17:06 fredoh9

wait... i thought I reproduced this, but it's NOT. Hum, devres log print in error level? Then I need to ignore all the errors from devres. Let me play with it more and update again.

fredoh9 avatar Jun 24 '21 17:06 fredoh9

devres errors are fatal IMHO so should be show-stoppers, not filtered.

plbossart avatar Jun 24 '21 18:06 plbossart

@fredoh9 @plbossart I checked 0624_hel_dmesg.log, but it doesn't have oops messages. Could we change the devres message to dev_info level temporarily?

--- a/drivers/base/devres.c
+++ b/drivers/base/devres.c
@@ -58,7 +58,7 @@ static void devres_log(struct device *dev, struct devres_node *node,
                       const char *op)
 {
        if (unlikely(log_devres))
-               dev_err(dev, "DEVRES %3s %p %s (%zu bytes)\n",
+               dev_info(dev, "DEVRES %3s %p %s (%zu bytes)\n",

I suspect the release processes while unloading the snd_soc_rt5682_sdw module. Although the oops messages happened while unloading the snd_soc_rt5682_i2c module, the snd_soc_rt5682_sdw module was done before.

shumingfan avatar Jun 25 '21 07:06 shumingfan

This could be rather verbose. @fredoh9 maybe we can enable devres dynamic debug by default, but only change the kernel parameter before doing the kmod-load-unload tests (using sysfs instead of the modprobe.d options).

Maybe @andy-shev knows why a dev_err() is used, it's been that way since the beginning but it doesn't seem like a error at all.

plbossart avatar Jun 25 '21 19:06 plbossart

Without changing anything, I couldn't reproduce the problem. Even sof ci also find once with above, no more report after that. I tried same with above test, linux pr 2989 + sof pr 4386. I even tried with same build binaries with the test. No luck yet. I tried to test with -l 500 also.

I can submit a PR to enable devres log before starting kmod test, but we need to ignore "DEVRES*". When kernel oops happens, then we will get all the DEVRES logs from dmesg.

BTW, we have 4 different REVRES op

  • ADD
  • REP : replace
  • REM : remove
  • REL : release

Very verbose, when the issue happens, we need to still go through each line.

fredoh9 avatar Jun 28 '21 14:06 fredoh9

I agree with @fredoh9 this can be a painful problem to solve.

@shumingfan @oder-chiou @bardliao I would bet the problem came from the fact that devm_ is used for the clock management inside the component .probe callback. This is in theory not very good. devm_ associates the resources to a device, so should be used in the I2C probe, not the component probe. I was looking at rt5682_probe and could not figure out why in the SoundWire case we need to call the wait_for_completion_timeout() - it's not done in any other codec driver - and in the I2C case is there any merit in getting/registering the clocks when the card is created, instead of when the I2C driver is probed?

Wondering if this code could be moved in the regular I2C probe and SoundWire init_io. Comments welcome.

plbossart avatar Jun 28 '21 15:06 plbossart

To enable DEVRES for kmod test, https://github.com/thesofproject/sof-test/pull/719

fredoh9 avatar Jun 28 '21 19:06 fredoh9

I was looking at rt5682_probe and could not figure out why in the SoundWire case we need to call the wait_for_completion_timeout() - it's not done in any other codec driver - and in the I2C case is there any merit in getting/registering the clocks when the card is created, instead of when the I2C driver is probed?

@plbossart Sometimes we encounter that the component probe came earlier than io_init, so the driver waits for SdW slave initialization.

Wondering if this code could be moved in the regular I2C probe and SoundWire init_io. Comments welcome.

I understand your point. That's why I think the DAI clk is released by snd_soc_rt5682_sdw. But we don't have the DEVRES logs to confirm that. I feel ok move DAI clock registration to the I2C probe and SdW io_init doesn't need DAI clock.

shumingfan avatar Jun 29 '21 03:06 shumingfan

@plbossart [ 5994.688772] kernel: Hardware name: Google Helios/Helios, BIOS 01/21/2020 [ 5994.688775] kernel: RIP: 0010:clk_unregister+0x189/0x240 It seems the resource was released before when the snd_soc_rt5682_i2c unloads.

Could you enable devres_log to get more information before the script test?

# cd /sys/module/devres/parameters/
# echo 1 > log

Please, start using devres trace events instead. Relevant change: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/commit/?h=driver-core-next&id=09705dcb63d269000595284b5dd7f5c938d647b9

andy-shev avatar Jun 30 '21 08:06 andy-shev

@andy-shev not easy, all the CI is based on the dmesg log, that's where we saw the kernel oops. Adding a trace mechanism and correlating it with an oops isn't something I know how to do. Pointers welcome.

plbossart avatar Jun 30 '21 13:06 plbossart

@andy-shev not easy, all the CI is based on the dmesg log, that's where we saw the kernel oops. Adding a trace mechanism and correlating it with an oops isn't something I know how to do. Pointers welcome.

There is a knob to print traces in dmesg in case of Oops (https://elixir.bootlin.com/linux/latest/source/kernel/trace/trace.c#L119). Also you may enable trace events in the kernel command line (https://www.kernel.org/doc/html/latest/trace/events.html#boot-option).

(A bit outdated, e.g. tracefs vs. debugfs) overview https://events.static.linuxfound.org/sites/events/files/slides/praesentation_0.pdf. But better to see the official in-kernel documentation.

andy-shev avatar Jun 30 '21 14:06 andy-shev

Kernel oops on Helios Chromebook in inner daily 6090 too, when check-kmod-load-unload [dmesg]

[ 7171.670587] kernel: BUG: kernel NULL pointer dereference, address: 000000000000000a
[ 7171.670610] kernel: #PF: supervisor write access in kernel mode
[ 7171.670626] kernel: #PF: error_code(0x0002) - not-present page
[ 7171.670643] kernel: PGD 0 P4D 0 
[ 7171.670671] kernel: Oops: 0002 [#1] SMP NOPTI
[ 7171.670695] kernel: CPU: 1 PID: 83006 Comm: systemd-udevd Not tainted 5.14.0-rc6-daily-default-20210825 #1
[ 7171.670722] kernel: Hardware name: Google Helios/Helios, BIOS  01/21/2020
[ 7171.670735] kernel: RIP: 0010:__clk_register+0x490/0x7e0
[ 7171.670772] kernel: Code: 89 c4 49 89 47 30 49 8d 87 b8 00 00 00 4d 85 e4 0f 84 7a 02 00 00 49 8b 94 24 b0 00 00 00 49 89 97 b8 00 00 00 48 85 d2 74 04 <48> 89 42 08 49 89 84 24 b0 00 00 00 49 8d 84 24 b0 00 00 00 49 89
[ 7171.670796] kernel: RSP: 0018:ffffa1d08158f958 EFLAGS: 00010202
[ 7171.670815] kernel: RAX: ffff9de2d68a94b8 RBX: ffff9de2f0f30bc8 RCX: 0000000071426b9b
[ 7171.670831] kernel: RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9de2d68a9400
[ 7171.670836] kernel: RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000000
[ 7171.670836] kernel: R10: 0000000000000000 R11: ffff9de3e705786f R12: ffff9de2e8aec1a8
[ 7171.670836] kernel: R13: ffff9de2e7057a20 R14: ffff9de2ec0f29e8 R15: ffff9de2d68a9400
[ 7171.670836] kernel: FS:  00007f28a2c37880(0000) GS:ffff9de416000000(0000) knlGS:0000000000000000
[ 7171.670836] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7171.670836] kernel: CR2: 000000000000000a CR3: 0000000116b10004 CR4: 00000000003706e0
[ 7171.670836] kernel: Call Trace:
[ 7171.670836] kernel:  ? clk_hw_unregister+0x10/0x10
[ 7171.670836] kernel:  clk_hw_register+0x19/0x40
[ 7171.670836] kernel:  devm_clk_hw_register+0x41/0x80
[ 7171.670836] kernel:  rt5682_probe+0x148/0x1f0 [snd_soc_rt5682]
[ 7171.670836] kernel:  snd_soc_component_probe+0x19/0x40 [snd_soc_core]
[ 7171.670836] kernel:  soc_probe_component+0x1cb/0x300 [snd_soc_core]
[ 7171.670836] kernel:  snd_soc_bind_card+0x506/0xcf0 [snd_soc_core]
[ 7171.670836] kernel:  ? is_module_address+0xc/0x20
[ 7171.670836] kernel:  ? lockdep_init_map_type+0x51/0x210
[ 7171.670836] kernel:  ? __raw_spin_lock_init+0x36/0x60
[ 7171.670836] kernel:  devm_snd_soc_register_card+0x3e/0x80 [snd_soc_core]
[ 7171.670836] kernel:  platform_probe+0x4f/0xb0
[ 7171.670836] kernel:  really_probe+0x1a6/0x3a0
[ 7171.670836] kernel:  __driver_probe_device+0xf9/0x170
[ 7171.670836] kernel:  driver_probe_device+0x19/0x90
[ 7171.670836] kernel:  __driver_attach+0x99/0x170
[ 7171.670836] kernel:  ? __device_attach_driver+0xd0/0xd0
[ 7171.670836] kernel:  ? __device_attach_driver+0xd0/0xd0
[ 7171.670836] kernel:  bus_for_each_dev+0x73/0xc0
[ 7171.670836] kernel:  bus_add_driver+0x14b/0x1f0
[ 7171.670836] kernel:  driver_register+0x67/0xb0
[ 7171.670836] kernel:  ? 0xffffffffc02e6000
[ 7171.670836] kernel:  do_one_initcall+0x64/0x220
[ 7171.670836] kernel:  ? __cond_resched+0x10/0x20
[ 7171.670836] kernel:  ? kmem_cache_alloc_trace+0x4a/0x1f0
[ 7171.670836] kernel:  do_init_module+0x56/0x200
[ 7171.670836] kernel:  load_module+0x258e/0x28f0
[ 7171.670836] kernel:  ? __do_sys_finit_module+0xae/0x110
[ 7171.670836] kernel:  __do_sys_finit_module+0xae/0x110
[ 7171.670836] kernel:  do_syscall_64+0x38/0x90
[ 7171.670836] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7171.670836] kernel: RIP: 0033:0x7f28a31b789d

XiaoyunWu6666 avatar Aug 26 '21 03:08 XiaoyunWu6666

@XiaoyunWu6666 Do you have the method to duplicate this issue again? I submitted PR #3130 that tries to fix this issue. Could you test with this PR?

shumingfan avatar Aug 31 '21 11:08 shumingfan

There is no magic test command to reproduce this for 100%. We can try with high iteration but even test with 100 loop may not reproduce the problem.

TPLG=sof-cml-rt1011-rt5682.tplg ~/sof-test/test-case/check-kmod-load-unload.sh -l 100

fredoh9 avatar Aug 31 '21 19:08 fredoh9

@XiaoyunWu6666 @fredoh9 I updated PR #3130 which passed the CI test with the CML_HEL_RT5682 machine. https://sof-ci.01.org/linuxpr/PR3130/build6398/devicetest/

Could you try to test with this PR?

shumingfan avatar Sep 23 '21 08:09 shumingfan

@XiaoyunWu6666 @fredoh9 I updated PR #3130 which passed the CI test with the CML_HEL_RT5682 machine. https://sof-ci.01.org/linuxpr/PR3130/build6398/devicetest/

Could you try to test with this PR?

Ran it in inner result 6926 , with kmod-load-unload for 50 rounds on multiple CML platforms. The test showed all PASS ... but not sure whether PR 3130 does fix it or not , because this is an issue with rather low reproduction rate

XiaoyunWu6666 avatar Sep 29 '21 02:09 XiaoyunWu6666

@keqiaozhang can you check if we've seen those issues on Helios Chromebook recently? it's almost been a year.

plbossart avatar Aug 26 '22 09:08 plbossart

@plbossart I don't think this is still reproducible, I don't even see this issue during the year of 2022 till now. I am OK to close it.

aiChaoSONG avatar Nov 17 '22 06:11 aiChaoSONG

I haven't gotten any bugs my way recently on this either

cujomalainey avatar Nov 29 '22 22:11 cujomalainey

let's close. There's still an issue with DA7219 but separate problem

plbossart avatar Nov 29 '22 23:11 plbossart