linux icon indicating copy to clipboard operation
linux copied to clipboard

Very frequent rt711-sdca "parity errors" on tglu-sku0a32-sdca / 6.10-rc5 regression?

Open marc-hb opened this issue 1 year ago • 19 comments

Starting from daily test run 43136, most tests fail on tglu-sku0a32-sdca most of the time, always because a parity error.

rt711-sdca sdw:0:0:025d:0711:01: Parity error detected

Last successful daily test run: 43016 on June 23rd with kernel commit e9eea140a1d9

First failing daily test run 43136 on June 25th with kernel commit c75a95d34885.

There are more than 1000 commits between the two, most of them apparently coming from 6.10-rc5

  • #5081
[  113.727660] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-0: No more devices to enumerate
[  113.727684] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000001
[  113.727698] kernel: soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling enumeration completion for Slave 6
[  113.727881] kernel: soundwire_bus:sdw_slave_set_frequency: rt711-sdca sdw:0:0:025d:0711:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[  113.829317] kernel: soundwire_cadence:cdns_check_attached_status_dwork: soundwire_intel soundwire_intel.link.0: Peripheral 6 status: 2
[  113.910742] kernel: snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[  113.915449] kernel: snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[  113.915458] kernel: snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[  113.915463] kernel: soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[  113.915517] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[  113.924650] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: open stream 0 dir 0
[  113.924652] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: period min 192 max 16384 bytes
[  113.924654] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: period count 2 max 16
[  113.924655] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: buffer max 65536 bytes
[  113.924803] kernel: snd_soc_rt711_sdca:rt711_sdca_pcm_hw_params: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_pcm_hw_params rt711-sdca-aif1
[  113.925461] kernel: rt711-sdca sdw:0:0:025d:0711:01: Parity error detected
[  113.925869] kernel: snd_sof:ipc3_log_header: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
[  113.926110] kernel: snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: hw params stream 0 dir 0
[  113.926117] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x40000 successful
[  113.926131] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x40000 successful
[  113.926134] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-tgl 0000:00:1f.3: period_bytes:0x4000
[  113.926135] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-tgl 0000:00:1f.3: periods:4
[  113.926137] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[  113.926158] kernel: snd_sof:sof_ipc3_pcm_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: stream_tag 1

cc: @bardliao , @plbossart , @ssavati , @shumingfan

marc-hb avatar Jul 02 '24 21:07 marc-hb

@bardliao is this something you can try to bisect with your TGL device? I don't have one available so can't do it myself...

plbossart avatar Jul 03 '24 09:07 plbossart

@bardliao is this something you can try to bisect with your TGL device? I don't have one available so can't do it myself...

Sure, I will try to bisect with my TGL device.

bardliao avatar Jul 03 '24 14:07 bardliao

FWIW, jf-cml-sku0983-sdw-1 uses RT711 too and has no failure.

marc-hb avatar Jul 03 '24 17:07 marc-hb

not the same version of RT711, CML uses the 'old' version and TGL the SDCA version. the -sdca suffix matters!

plbossart avatar Jul 04 '24 05:07 plbossart

This seems to be related to https://github.com/thesofproject/linux/pull/5074. I did some test today and confirmed that no issue on topic/sof-dev-rebase and broonie/for-next branch. And can reproduce the issue after applying https://github.com/thesofproject/linux/pull/5074 on top of the topic/sof-dev-rebase branch.

bardliao avatar Jul 04 '24 08:07 bardliao

Humm very interesting.

the first two commits of PR #5074 are ace-2.x specific so that would point to the last patch with the deferred workqueue.

I can't quite figure out what the deferred enumeration check would have to do with the parity error.

Can you try with this that last commit @bardliao ?

plbossart avatar Jul 04 '24 08:07 plbossart

Humm very interesting.

the first two commits of PR #5074 are ace-2.x specific so that would point to the last patch with the deferred workqueue.

I can't quite figure out what the deferred enumeration check would have to do with the parity error.

Can you try with this that last commit @bardliao ?

Sure, I am guessing it is related to slave->first_interrupt_done.

bardliao avatar Jul 04 '24 08:07 bardliao

@plbossart Apply the "soundwire: cadence: re-check Peripheral status with delayed_work" commit can reproduce the issue.

bardliao avatar Jul 04 '24 08:07 bardliao

can you check the logs @bardliao, in theory if you see a parity error then it should be reported in this delayed work part

	for (i = 0; i <= SDW_MAX_DEVICES; i++) {
		status[i] = val & 0x3;
		if (status[i])
			dev_dbg(cdns->dev, "Peripheral %d status: %d\n", i, status[i]);
		val >>= 2;
	}

If there's really a parity issue this would be status 0x2 (alert).

plbossart avatar Jul 04 '24 09:07 plbossart

@plbossart Yes, [ 33.214013] soundwire_cadence:cdns_check_attached_status_dwork: soundwire_intel soundwire_intel.link.0: Peripheral 6 status: 2 And sdw_handle_slave_alerts is called twice.

[   33.214031] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts
[   33.294601] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts

Then

[   33.302491] rt711-sdca sdw:0:0:025d:0711:01: bard: buf 0x1
[   33.302494] rt711-sdca sdw:0:0:025d:0711:01: bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done 0 slave->prop.scp_int1_mask 0x3 slave->prop.quirks 0x1
[   33.302496] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts clear 0x1
[   33.302553] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts set first_interrupt_done = true
[   33.302616] rt711-sdca sdw:0:0:025d:0711:01: bard: buf 0x1
[   33.302617] rt711-sdca sdw:0:0:025d:0711:01: bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done 1 slave->prop.scp_int1_mask 0x3 slave->prop.quirks 0x1
[   33.302618] rt711-sdca sdw:0:0:025d:0711:01: Parity error detected
[   33.302621] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts clear 0x1
[   33.302678] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts set first_interrupt_done = true
diff --git a/drivers/soundwire/bus.c b/drivers/soundwire/bus.c
index 263ca32f0c5c..53df723fe258 100644
--- a/drivers/soundwire/bus.c
+++ b/drivers/soundwire/bus.c
@@ -1618,6 +1618,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
        bool parity_check;
        bool parity_quirk;

+       dev_info(&slave->dev, "bard: %s\n", __func__);
        sdw_modify_slave_status(slave, SDW_SLAVE_ALERT);

        ret = pm_runtime_get_sync(&slave->dev);
@@ -1654,6 +1655,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
        }

        do {
+               dev_info(&slave->dev, "bard: buf %#x\n", buf);
                slave_notify = false;

                /*
@@ -1665,6 +1667,8 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
                        parity_quirk = !slave->first_interrupt_done &&
                                (slave->prop.quirks & SDW_SLAVE_QUIRKS_INVALID_INITIAL_PARITY);

+                       dev_info(&slave->dev, "bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done %d slave->prop.scp_int1_mask %#x slave->prop.quirks %#x\n",
+                               slave->first_interrupt_done, slave->prop.scp_int1_mask, slave->prop.quirks);
                        if (parity_check && !parity_quirk)
                                dev_err(&slave->dev, "Parity error detected\n");
                        clear |= SDW_SCP_INT1_PARITY;
@@ -1754,6 +1758,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
                }

                /* Ack interrupt */
+               dev_info(&slave->dev, "bard: %s clear %#x\n", __func__, clear);
                ret = sdw_write_no_pm(slave, SDW_SCP_INT1, clear);
                if (ret < 0) {
                        dev_err(&slave->dev,
@@ -1763,6 +1768,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)

                /* at this point all initial interrupt sources were handled */
                slave->first_interrupt_done = true;
+               dev_info(&slave->dev, "bard: %s set first_interrupt_done = true\n", __func__);

                /*
                 * Read status again to ensure no new interrupts arrived

My theory is that we got the alert interrupt and right after that cdns_check_attached_status_dwork is called. Full log: dmesg_parity.txt

bardliao avatar Jul 05 '24 02:07 bardliao

@plbossart Potential fix: https://github.com/thesofproject/linux/pull/5095

bardliao avatar Jul 05 '24 06:07 bardliao

cancel_work_sync() does two things a) it cancels the work if it didn't start already or, b) it waits for the work to be finished.

the b) part makes sense here, but what about a)?

If that if we cancel a scheduled workqueue before it's run, we will keep the interrupts disabled forever, no?

How would we deal with the alert if we first deal with the attachment dwork?

plbossart avatar Jul 05 '24 06:07 plbossart

the scenario that worries me is

a) an alert occurs b) sdw_cdns_irq() detects the alert and schedules the workqueue c) the dwork is scheduled and it cancels the workqueue before it's executed

-> cdns_update_slave_status_work() will not be executed and as a result the interrupt will not be cleared with a write to MCP_INTSTAT

the open is whether would be handled a second time. Because the device did not change status, the INTSTATn bit will not change so the interrupt will not be handled.

Looks like we need a different way of preventing the two workqueues from executing concurrently, maybe the only solution is a mutex?

plbossart avatar Jul 05 '24 08:07 plbossart

Alternate mutex-based solution in https://github.com/thesofproject/linux/pull/5096 @bardliao - completely untested, just to show what I had in mind.

plbossart avatar Jul 05 '24 08:07 plbossart

the scenario that worries me is

a) an alert occurs b) sdw_cdns_irq() detects the alert and schedules the workqueue c) the dwork is scheduled and it cancels the workqueue before it's executed

-> cdns_update_slave_status_work() will not be executed and as a result the interrupt will not be cleared with a write to MCP_INTSTAT

the open is whether would be handled a second time. Because the device did not change status, the INTSTATn bit will not change so the interrupt will not be handled.

Looks like we need a different way of preventing the two workqueues from executing concurrently, maybe the only solution is a mutex?

We will check Peripheral status after the interrupts are disabled in cdns_check_attached_status_dwork, right? And we can get the alert. Am I missing something?

bardliao avatar Jul 05 '24 08:07 bardliao

yes we'll get the peripheral status, but we will not deal with the interrupt registers, e.g. MCP_INTSTAT will not be cleared. I think we will create a mess if we deal with interrupt registers in two places. All we wanted was to read and update the status, this can be done in a simpler way I think

plbossart avatar Jul 05 '24 09:07 plbossart

Just spotted a couple of parity errors today on ba-adlp-rvp-sdw-02 in the daily test run.

43530?model=ADLP_RVP_SDW-ipc4&testcase=check-playback-50rounds

They are nowhere near as frequent though.

Still very frequent on ba-adlp-rvp-sdw-01. Today's 43592?model=ADLP_RVP_SDW-ipc4&testcase=multiple-pause-resume-50

marc-hb avatar Jul 05 '24 21:07 marc-hb

These parity errors have been noising test results for 2 weeks now. As usual, this will cause other, unrelated regressions to go unnoticed. Could there be some temporary workaround/revert in the short term?

marc-hb avatar Jul 11 '24 20:07 marc-hb

These parity errors have been noising test results for 2 weeks now. As usual, this will cause other, unrelated regressions to go unnoticed. Could there be some temporary workaround/revert in the short term?

@marc-hb https://github.com/thesofproject/linux/pull/5096 is expected to fix the issue.

bardliao avatar Jul 13 '24 13:07 bardliao

@marc-hb https://github.com/thesofproject/linux/pull/5096 is expected to fix the issue.

Today's daily run 43863 is not finished yet but I'm already seeing a lot of green!

marc-hb avatar Jul 15 '24 17:07 marc-hb

Today's daily run 43863 is not finished yet but I'm already seeing a lot of green!

All green; from my perspective this can be closed.

marc-hb avatar Jul 15 '24 20:07 marc-hb