sof
sof copied to clipboard
[BUG] [ZEPHYR] sof-logger dead: Invalid filename length 1650680879 or ldc file does not match firmware
Rarely, the sof-logger dies like this at the very start of a test: https://sof-ci.01.org/sofpr/PR5340/build11976/devicetest/?model=TGLU_UP_HDA_ZEPHYR&testcase=test-speaker
error: Invalid filename length 1650680879 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x25007) returned -22
error: fetch_entry() failed with: -22, aborting
This happens in about 1 out 50 test runs.
The dictionary is valid because other tests immediately before and after are fine in the exact same configuration.
While the "sof-logger was already dead" error message is in the end the same as #5120 (which was hiding this bug), the symptoms are extremely different.
This caused a number of failures in recent daily runs 10079, 10105 and 10146. It seems Zephyr specific. It's not clear why it's even more rare in PR testing.
10079?model=TGLU_UP_HDA_ZEPHYR&testcase=volume-basic-test-50 10079?model=WHL_UPEXT_HDA_ZEPHYR&testcase=multiple-pipeline-capture-50
10105?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pipeline-playback-50
10146?model=APL_UP2_NOCODEC_ZEPHYR&testcase=check-capture-50rounds 10146?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pause-resume-50 10146?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=check-xrun-injection-playback-10
Daily run 10079:
Start Time: 2022-02-09 22:27:36 UTC Kernel Branch: topic/sof-dev Kernel Commit: 8bf95cb0 SOF Branch: main SOF Commit: 9983ab67a9fe Zephyr Commit: 41a63f48fdc8
DMA alignment fix c11562b00a29 for #5120 was between 10079 and 10105
Daily run 10105 was:
Start Time: 2022-02-10 22:27:29 UTC Kernel Branch: topic/sof-dev Kernel Commit: 936898a6 SOF Branch: main SOF Commit: 443b21de4b3f Zephyr Commit: 5094a6e08cee
Daily run 10146 was:
Kernel Branch: topic/sof-dev Kernel Commit: 98119478 SOF Branch: main SOF Commit: b8954754f055 Zephyr Commit: v3.0.0-rc3
P1 because it affects PR and daily testing.
@kv2019i could this be related to https://github.com/thesofproject/linux/issues/3275
@jsarha any clue now that you're an expert after fixing #5120?
1650680879 == 0x6263642F == "bcd/" So it looks like we are reading a header? as length.
I don't think so, the values seem fairly random. Here's a sample of a few different failures:
error: Invalid filename length 12320 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x35008) returned -22
error: fetch_entry() failed with: -22, aborting
error: Invalid filename length 1650680879 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x25007) returned -22
error: fetch_entry() failed with: -22, aborting
error: Invalid filename length 1601203553 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x22006) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.
error: Invalid filename length 1601463667 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x22006) returned -22
error: fetch_entry() failed with: -22, aborting
error: Invalid filename length 7471104 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x23006) returned -22
error: fetch_entry() failed with: -22, aborting
Something strange happened in the middle of a test on WHL: https://sof-ci.01.org/softestpr/PR862/build985/devicetest/?model=WHL_UPEXT_HDA_ZEPHYR&testcase=check-suspend-resume-with-playback-5
[ 2728877.704064] ( 14.114583) c0 hda-dma ..../intel/hda/hda-dma.c:936 INFO hda-dmac :7 -> remove
[ 2728973.954060] ( 96.249992) c0 dma src/lib/dma.c:141 INFO dma_put(), dma = 0x9e06e570, sref = 0
[ 2728986.089477] ( 12.135416) c0 hda-dai 3.6 /drivers/intel/hda/hda.c:90 INFO hda_remove()
[ 2729002.756143] ( 16.666666) c0 dai src/lib/dai.c:188 INFO dai_put type 3 index 6 new sref 0
[ 2729116.558221] ( 113.802078) c0 ipc src/ipc/ipc3/handler.c:1579 INFO ipc: new cmd 0x30110000
[ 2729129.422804] ( 12.864583) c0 ipc src/ipc/ipc3/handler.c:1361 INFO ipc: comp 62 -> free
[ 2729141.089470] ( 11.666666) c0 pipe 9.62 ......./pipeline-graph.c:195 INFO pipeline_free()
error: log_entry_address 0x204b0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.
Found valid LDC address after skipping 4 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid filename length 1952539747 or ldc file does not match firmware
error: read_entry_from_ldc_file(0x29007) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.
This does not appear to happen on my TGL board. I have now run test-speaker.sh 730 times (had it running in a loop over night) on Zephyr build of the latest https://github.com/thesofproject/sof main.
TGL does not seem to reproduce the most frequently and test-speaker neither. Do you have an APL board?
I would also recommend reloading the firmware regularly, just an intuition based on staring at a lot of failures recently. https://github.com/thesofproject/sof-test/tree/main/tools/kmod works great.
EDIT: seen again in daily 10429
Start Time: 2022-02-22 22:27:21 UTC Kernel Branch: topic/sof-dev Kernel Commit: 5b26e39d SOF Branch: main SOF Commit: 58ce6e68608f Zephyr Commit: 8239aea29c16
10429?model=APL_UP2_NOCODEC_ZEPHYR&testcase=multiple-pipeline-all-50 10429?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=check-xrun-injection-playback-10 10429?model=TGLH_RVP_NOCODEC_ZEPHYR&testcase=multiple-pipeline-capture-50
@jsarha,this issue is still valid in CI, we see this issue almost every day in "CI daily test(Today's test ID:10429)". For TGL platform, maybe you can test with check-xrun-injection.sh.
Lower priority to P2, as we're going to abandon DMA-based SOF trace and use Zephyr native LOG implementation. Zephyr native LOG implementation will probably fix this problem.
Unlike bug severity, priority is never "objectively" defined with specified metrics. Priority can be defined in many different, project-specific ways based on various inputs and preferences. However I don't think I've seen the implementation details of the fix make any priority difference before.
Slightly different failure in daily 10797?model=WHL_UPEXT_HDA_ZEPHYR&testcase=check-capture-50rounds
Still happening at the very start of the test.
TIMESTAMP (us) DELTA C# COMPONENT LOCATION CONTENT ktime=1980.913s @ 2022-03-08 23:03:17 +0000 UTC
Found valid LDC address after skipping 8 bytes (one line uses 20 + 0 to 16 bytes)
error: Invalid number of parameters.
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.
New failure in daily 11126?modelFirmwareType=SOF-Zephyr&model=ADLP_RVP_SDW_ZEPHYR&testcase=multiple-pipeline-capture-50, happening at the very start of the test.
error: Invalid text length.
error: read_entry_from_ldc_file(0x26007) returned -22
error: fetch_entry() failed with: -22, aborting
Skipped 0 bytes after the last statement.
More DMA trace corruption with Zephyr, this time not at the very start: https://sof-ci.01.org/sofpr/PR5631/build12594/devicetest/
Downgrading this failure to a SKIP because it's too frequent; too much pollution
- thesofproject/sof-test/pull/895
Downgrading this failure to a SKIP because it's too frequent; too much pollution
This made a huge difference but there are still some failures, typically like this: https://sof-ci.01.org/sofpr/PR5796/build327/devicetest/?model=CML_RVP_SDW_ZEPHYR&testcase=check-sof-logger
error: log_entry_address 0 is not in dictionary range!
There is only so much DMA corruption you can ignore.
Same in https://sof-ci.01.org/sofpr/PR5266/build399/devicetest/?model=CML_RVP_SDW_ZEPHYR&testcase=check-sof-logger
Still happening: https://sof-ci.01.org/softestpr/PR943/build110/devicetest/?model=ADLP_RVP_SDW_ZEPHYR&testcase=test-speaker
Won't fix cavs tool issue. Will switch to new logging tool.