sof icon indicating copy to clipboard operation
sof copied to clipboard

[BUG] [ZEPHYR] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware

Open marc-hb opened this issue 3 years ago • 15 comments

Rarely, the sof-logger dies like this at the very start of a test: https://sof-ci.01.org/sofpr/PR5340/build11976/devicetest/?model=TGLU_UP_HDA_ZEPHYR&testcase=test-speaker

error: Invalid filename length 1650680879 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x25007) returned -22
error: fetch_entry() failed with: -22, aborting

This happens in about 1 out 50 test runs.

The dictionary is valid because other tests immediately before and after are fine in the exact same configuration.

While the "sof-logger was already dead" error message is in the end the same as #5120 (which was hiding this bug), the symptoms are extremely different.

This caused a number of failures in recent daily runs 10079, 10105 and 10146. It seems Zephyr specific. It's not clear why it's even more rare in PR testing.

10079?model=TGLU_UP_HDA_ZEPHYR&testcase=volume-basic-test-50 10079?model=WHL_UPEXT_HDA_ZEPHYR&testcase=multiple-pipeline-capture-50

10105?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pipeline-playback-50

10146?model=APL_UP2_NOCODEC_ZEPHYR&testcase=check-capture-50rounds 10146?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pause-resume-50 10146?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=check-xrun-injection-playback-10

Daily run 10079:

Start Time: 2022-02-09 22:27:36 UTC Kernel Branch: topic/sof-dev Kernel Commit: 8bf95cb0 SOF Branch: main SOF Commit: 9983ab67a9fe Zephyr Commit: 41a63f48fdc8

DMA alignment fix c11562b00a29 for #5120 was between 10079 and 10105

Daily run 10105 was:

Start Time: 2022-02-10 22:27:29 UTC Kernel Branch: topic/sof-dev Kernel Commit: 936898a6 SOF Branch: main SOF Commit: 443b21de4b3f Zephyr Commit: 5094a6e08cee

Daily run 10146 was:

Kernel Branch: topic/sof-dev Kernel Commit: 98119478 SOF Branch: main SOF Commit: b8954754f055 Zephyr Commit: v3.0.0-rc3

marc-hb avatar Feb 12 '22 01:02 marc-hb

P1 because it affects PR and daily testing.

@kv2019i could this be related to https://github.com/thesofproject/linux/issues/3275

@jsarha any clue now that you're an expert after fixing #5120?

marc-hb avatar Feb 12 '22 01:02 marc-hb

1650680879 == 0x6263642F == "bcd/" So it looks like we are reading a header? as length.

lgirdwood avatar Feb 14 '22 10:02 lgirdwood

I don't think so, the values seem fairly random. Here's a sample of a few different failures:

error: Invalid filename length 12320 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x35008) returned -22
error: fetch_entry() failed with: -22, aborting

error: Invalid filename length 1650680879 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x25007) returned -22
error: fetch_entry() failed with: -22, aborting

error: Invalid filename length 1601203553 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x22006) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

error: Invalid filename length 1601463667 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x22006) returned -22
error: fetch_entry() failed with: -22, aborting

error: Invalid filename length 7471104 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x23006) returned -22
error: fetch_entry() failed with: -22, aborting

marc-hb avatar Feb 14 '22 20:02 marc-hb

Something strange happened in the middle of a test on WHL: https://sof-ci.01.org/softestpr/PR862/build985/devicetest/?model=WHL_UPEXT_HDA_ZEPHYR&testcase=check-suspend-resume-with-playback-5

[     2728877.704064] (          14.114583) c0 hda-dma            ..../intel/hda/hda-dma.c:936  INFO hda-dmac :7 -> remove
[     2728973.954060] (          96.249992) c0 dma                           src/lib/dma.c:141  INFO dma_put(), dma = 0x9e06e570, sref = 0
[     2728986.089477] (          12.135416) c0 hda-dai      3.6   /drivers/intel/hda/hda.c:90   INFO hda_remove()
[     2729002.756143] (          16.666666) c0 dai                           src/lib/dai.c:188  INFO dai_put type 3 index 6 new sref 0
[     2729116.558221] (         113.802078) c0 ipc                  src/ipc/ipc3/handler.c:1579 INFO ipc: new cmd 0x30110000
[     2729129.422804] (          12.864583) c0 ipc                  src/ipc/ipc3/handler.c:1361 INFO ipc: comp 62 -> free
[     2729141.089470] (          11.666666) c0 pipe         9.62  ......./pipeline-graph.c:195  INFO pipeline_free()
error: log_entry_address    0x204b0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.

Found valid LDC address after skipping 4 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid filename length 1952539747 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x29007) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

marc-hb avatar Feb 19 '22 04:02 marc-hb

This does not appear to happen on my TGL board. I have now run test-speaker.sh 730 times (had it running in a loop over night) on Zephyr build of the latest https://github.com/thesofproject/sof main.

jsarha avatar Feb 21 '22 08:02 jsarha

TGL does not seem to reproduce the most frequently and test-speaker neither. Do you have an APL board?

I would also recommend reloading the firmware regularly, just an intuition based on staring at a lot of failures recently. https://github.com/thesofproject/sof-test/tree/main/tools/kmod works great.

EDIT: seen again in daily 10429

Start Time: 2022-02-22 22:27:21 UTC Kernel Branch: topic/sof-dev Kernel Commit: 5b26e39d SOF Branch: main SOF Commit: 58ce6e68608f Zephyr Commit: 8239aea29c16

10429?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pipeline-all-50 10429?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=check-xrun-injection-playback-10 10429?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=multiple-pipeline-capture-50

marc-hb avatar Feb 22 '22 02:02 marc-hb

@jsarha,this issue is still valid in CI, we see this issue almost every day in "CI daily test(Today's test ID:10429)". For TGL platform, maybe you can test with check-xrun-injection.sh.

keqiaozhang avatar Feb 23 '22 03:02 keqiaozhang

Lower priority to P2, as we're going to abandon DMA-based SOF trace and use Zephyr native LOG implementation. Zephyr native LOG implementation will probably fix this problem.

mengdonglin avatar Feb 28 '22 02:02 mengdonglin

Unlike bug severity, priority is never "objectively" defined with specified metrics. Priority can be defined in many different, project-specific ways based on various inputs and preferences. However I don't think I've seen the implementation details of the fix make any priority difference before.

marc-hb avatar Feb 28 '22 17:02 marc-hb

Slightly different failure in daily 10797?model=WHL_UPEXT_HDA_ZEPHYR&testcase=check-capture-50rounds

Still happening at the very start of the test.

 TIMESTAMP         (us)              DELTA  C# COMPONENT          LOCATION                      CONTENT	ktime=1980.913s  @  2022-03-08 23:03:17 +0000 UTC

Found valid LDC address after skipping 8 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid number of parameters.
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

marc-hb avatar Mar 09 '22 18:03 marc-hb

New failure in daily 11126?modelFirmwareType=SOF-Zephyr&model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pipeline-capture-50, happening at the very start of the test.

error: Invalid text length.
error: read_entry_from_ldc_file(0x26007) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.

keqiaozhang avatar Mar 17 '22 00:03 keqiaozhang

More DMA trace corruption with Zephyr, this time not at the very start: https://sof-ci.01.org/sofpr/PR5631/build12594/devicetest/

marc-hb avatar Apr 01 '22 06:04 marc-hb

Downgrading this failure to a SKIP because it's too frequent; too much pollution

  • thesofproject/sof-test/pull/895

marc-hb avatar Apr 28 '22 01:04 marc-hb

Downgrading this failure to a SKIP because it's too frequent; too much pollution

This made a huge difference but there are still some failures, typically like this: https://sof-ci.01.org/sofpr/PR5796/build327/devicetest/?model=CML_RVP_SDW_ZEPHYR&testcase=check-sof-logger

error: log_entry_address          0 is not in dictionary range!

There is only so much DMA corruption you can ignore.

Same in https://sof-ci.01.org/sofpr/PR5266/build399/devicetest/?model=CML_RVP_SDW_ZEPHYR&testcase=check-sof-logger

marc-hb avatar May 11 '22 16:05 marc-hb

Still happening: https://sof-ci.01.org/softestpr/PR943/build110/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=test-speaker

marc-hb avatar Jul 26 '22 04:07 marc-hb

Won't fix cavs tool issue. Will switch to new logging tool.

mengdonglin avatar Sep 06 '22 01:09 mengdonglin