linux icon indicating copy to clipboard operation
linux copied to clipboard

[SoundWire] Invalid register access for RT1308

Open plbossart opened this issue 3 years ago • 17 comments

Internal Intel test result/planresultdetail/16234?model=TGLH_SKU0A5E_SDW_ZEPHYR&testcase=verify-kernel-boot-log

[    5.539486] kernel: SDW: Invalid device for paging :0
[    5.539490] kernel: rt1308 sdw:2:025d:1308:00: ASoC: error at snd_soc_component_update_bits on sdw:2:025d:1308:00 for register: [0x0000c070] -22

Dmesg: dmesg.txt

@bardliao FYI

plbossart avatar Oct 13 '22 23:10 plbossart

Could be a hardware issue. I see an 'enumeration_complete' and 'initialization_complete' only for one of the two amplifiers

[    4.346579] kernel: soundwire_bus:sdw_modify_slave_status: rt1308 sdw:1:025d:1308:00: signaling enumeration completion for Slave 4
[    4.346671] kernel: soundwire_bus:sdw_modify_slave_status: rt711 sdw:0:025d:0711:00: signaling enumeration completion for Slave 5
[    4.346802] kernel: soundwire_bus:sdw_modify_slave_status: rt715 sdw:3:025d:0715:00: signaling enumeration completion for Slave 6
[    4.351043] kernel: soundwire_bus:sdw_handle_slave_status: rt715 sdw:3:025d:0715:00: signaling initialization completion for Slave 6
[    4.353732] kernel: soundwire_bus:sdw_handle_slave_status: rt711 sdw:0:025d:0711:00: signaling initialization completion for Slave 5
[    4.355242] kernel: soundwire_bus:sdw_handle_slave_status: rt1308 sdw:1:025d:1308:00: signaling initialization completion for Slave 4

and all we get on link2 is:

[    7.959891] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 15

So I would think the device never showed up on the bus.

That said if that was the case there shouldn't be any register access. That's likely a corner case we didn't see before. Oh well, just what we needed, huh?

plbossart avatar Oct 13 '22 23:10 plbossart

@shumingfan any idea what this c070 register is?

it's part of the regmap apparently:

 	regmap_write(rt1308->regmap, 0xc070, 0x00);

plbossart avatar Oct 13 '22 23:10 plbossart

Last known good daily run: 16107

Start Time: 2022-10-11 21:36:57 UTC

Kernel Branch: topic/sof-dev Kernel Commit: f048bf5646cc SOF Branch: main SOF Commit: 5594ad351283 Zephyr Commit: v3.2.0-rc1-2-g0956647aaf6b

marc-hb avatar Oct 13 '22 23:10 marc-hb

I don't see any PRs merged in the last few days that could explain this error.

Can we run a test with plain vanilla 6.0 kernel on this device to see how this goes?

plbossart avatar Oct 13 '22 23:10 plbossart

@plbossart I checked the kernel log. I think the system wants to restore the control settings. The "RX Channel Select" control will access the 0xc070 register. However, the sdw:2:025d:1308:00 device doesn't be enumerated yet. I will try to add the default settings for 0xc070 register to avoid this issue.

shumingfan avatar Oct 14 '22 01:10 shumingfan

This is a hardware issue. I've tried yesterday's daily kernel on this device and it can be still reproduced, and after a long power-off, this issue disappeared.

Anyway, it would be better if the system can restore the settings automatically.

miRoox avatar Oct 14 '22 03:10 miRoox

@miRoox @plbossart Could you please apply PR #3927 to check if the issue will disappear?

shumingfan avatar Oct 14 '22 06:10 shumingfan

@shumingfan thanks, but sorry I don't know the way to reproduce it after the register got recovered, so I can only check the PR won't introduce other issues, but it's hard to say whether it really fixes this issue.

miRoox avatar Oct 14 '22 08:10 miRoox

Humm, what would adding the default value change @shumingfan? If userspace wants to restore a non-default value, then there will be an access generated.

I am wondering if we don't have a conceptual issue here. On boot, the card is registered when all required components are registered. That registration is done when we probe a driver, and that is done based on ACPI information. So it can very well happen that if the physical device is late in its enumeration we end-up exposing a non-functional device.

Moving the component registration after the device initialization would not help. The deferred_probe mechanism only works when there's a new probe.

if we did a driver probe when the physical device enumerates, that would create other issues, because we would need to wait for the probe to be finished to be able to call the initialization callback provided precisely by the driver.

That gives me a migraine.

@bardliao thoughts?

plbossart avatar Oct 14 '22 12:10 plbossart

@plbossart Adding the default value can ensure that driver will not try to access the register and can get the cached value properly when the regmap is in cache only mode. Currently, we set cache only = true in suspend and cache only = false in resume. My idea is to add cache only = true in the probe callback, and move cache only = false in the update_status callback when codec driver get the SDW_SLAVE_ATTACHED status. That way if userspace wants to restore a non-default value, then driver will cache those values and sync them once the codec is attached.

bardliao avatar Oct 17 '22 06:10 bardliao

That sounds like a good move @bardliao, we do need to prevent access to the registers until the initialization completes. The suspend-resume case is good but it doesn't cover the boot case.

plbossart avatar Oct 17 '22 14:10 plbossart

This issue happened again in our daily test. Test ID:16577

keqiaozhang avatar Oct 21 '22 01:10 keqiaozhang

@shumingfan I just tested #3927 because this issue happened again, but looks like the issue is still there even with #3927.

dmesg.txt

miRoox avatar Oct 21 '22 02:10 miRoox

@miRoox PR #3927 might solve a certain case. The solution for this issue is PR #3941.

shumingfan avatar Oct 21 '22 02:10 shumingfan

Another instance of the issue, just for the record: 16815?model=TGLH_SKU0A5E_SDW_ZEPHYR&testcase=check-sof-logger 16815?model=TGLH_SKU0A5E_SDW_ZEPHYR&testcase=verify-kernel-boot-log

Also this causes failures in a handful of the following tests.

greg-intel avatar Oct 25 '22 21:10 greg-intel

Still happening in today's daily 17437?model=TGLH_SKU0A5E_SDW_ZEPHYR&testcase=verify-kernel-boot-log

Draft PR https://github.com/thesofproject/linux/pull/3941 not merged.

I spotted some BIOS error too, dunno whether it's related

[    4.593755] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.604760] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.615774] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.626752] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.635199] kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_TZ.ETMD], AE_NOT_FOUND (20220331/psargs-330)
[    4.635254] kernel: ACPI Error: Aborting method \_SB.IETM._OSC due to previous error (AE_NOT_FOUND) (20220331/psparse-529)
[    4.637780] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.648755] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.659756] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.670760] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.681801] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888
[    4.693317] kernel: snd_soc_rt711:rt711_sdw_read: rt711 sdw:0:025d:0711:00: [rt711_sdw_read] 7558 85d8 9c58 acd8 => 0000f888


[    5.546254] kernel: snd_soc_rt711:rt711_sdw_write: rt711 sdw:0:025d:0711:00: [rt711_sdw_write] 7313 8393 <= 7003
[    5.546372] kernel: snd_soc_rt711:rt711_sdw_write: rt711 sdw:0:025d:0711:00: [rt711_sdw_write] 8393 7313 <= 7003
[    5.546434] kernel: snd_soc_rt711:rt711_sdw_write: rt711 sdw:0:025d:0711:00: [rt711_sdw_write] 3501 <= 0003
[    5.546676] kernel: SDW: Invalid device for paging :0
[    5.546681] kernel: rt1308 sdw:2:025d:1308:00: ASoC: error at snd_soc_component_update_bits on sdw:2:025d:1308:00 for register: [0x0000c070] -22
[    5.546772] kernel: snd_soc_rt715:rt715_sdw_write: rt715 sdw:3:025d:0715:00: [rt715_sdw_write] 3501 <= 0000
[    5.546896] kernel: snd_soc_rt715:rt715_sdw_write: rt715 sdw:3:025d:0715:00: [rt715_sdw_write] 7307 8387 <= 6080
[    5.547021] kernel: snd_soc_rt715:rt715_sdw_write: rt715 sdw:3:025d:0715:00: [rt715_sdw_write] 8387 7307 <= 5017
[    5.547146] kernel: snd_soc_rt715:rt715_sdw_write: rt715 sdw:3:025d:0715:00: [rt715_sdw_write] 7307 8387 <= 6080

marc-hb avatar Nov 07 '22 22:11 marc-hb

fwiw, I have a Dell Latitude 9520 (SKU: 0A3E) which have both rt1308, rt715 and rt711 (loads sof-tgl-rt715-rt711-rt1308-mono.tplg). I can not reproduce any of the issues seen in CI with these codecs (on sh-tglh-0a5e-sdw-01).

[   64.980723] kernel: rt1308 sdw:2:025d:1308:00: DPN_PortCtrl register write failed for port 1
[   64.980727] kernel: soundwire sdw-master-2: Program transport params failed: -61
[   64.980732] kernel: soundwire sdw-master-2: Program params failed: -61
[   64.980735] kernel:  SDW1-Playback: ASoC: error at snd_soc_link_prepare on SDW1-Playback: -61
[   64.980739] kernel:  SDW1-Playback: ASoC: error at __soc_pcm_prepare on SDW1-Playback: -61
[   64.980742] kernel:  Speaker: ASoC: error at dpcm_be_dai_prepare on Speaker: -61
[   64.980746] kernel:  Speaker: ASoC: error at dpcm_fe_dai_prepare on Speaker: -61

or

[ 1032.176065] kernel: soundwire sdw-master-1: sdw_compute_bus_params: could not find frame configuration for bus dr_freq 9600000
[ 1032.176068] kernel: soundwire sdw-master-1: Compute params failed: -22
[ 1032.176070] kernel:  SDW1-Playback: ASoC: error at snd_soc_link_prepare on SDW1-Playback: -22
[ 1032.176072] kernel:  SDW1-Playback: ASoC: error at __soc_pcm_prepare on SDW1-Playback: -22
[ 1032.176074] kernel:  Speaker: ASoC: error at dpcm_be_dai_prepare on Speaker: -22
[ 1032.176076] kernel:  Speaker: ASoC: error at dpcm_fe_dai_prepare on Speaker: -22

or the ones mentioned in this issue.

ujfalusi avatar Nov 08 '22 10:11 ujfalusi

@marc-hb do you know if we still test this platform? It's been a while and I've lost track of whether this was a hardware issue or something we need to look into.

plbossart avatar Dec 13 '22 00:12 plbossart

I don't know, @keqiaozhang , @greg-intel ?

marc-hb avatar Dec 13 '22 21:12 marc-hb

Yes, the daily test uses TGLH_SKU0A5E_SDW_ZEPHYR (Also TGLH_SKU0A70_HDA_ZEPHYR, of less importance)

greg-intel avatar Dec 14 '22 00:12 greg-intel

On the last three daily tests for TGLH_SKU0A5E_SDW_ZEPHYR, two were not run (not available) and one passed. This makes it hard to figure out if this issue is still present.

plbossart avatar Dec 14 '22 00:12 plbossart

@plbossart This issue is not present in CI. I did a cold reboot recently(AC off, wait about 1~2 mins for the device to discharge fully, then power on). This issue has occurred several times before, and we all use the same method to restore it.

keqiaozhang avatar Dec 14 '22 01:12 keqiaozhang