sof icon indicating copy to clipboard operation
sof copied to clipboard

LLEXT: fix CI failures and make DRC an LLEXT module by default on MTL

Open lyakh opened this issue 1 year ago • 23 comments

Build all the code, supporting LLEXT, modular UPDATE: changed to only build DRC as LLEXT on MTL

lyakh avatar May 10 '24 14:05 lyakh

It has been suggested that the CI might just work if we switch any module to llext and build a deployable layout.

@marc-hb ok, no, a very clear indication that it unfortunately doesn't "just work:" https://sof-ci.01.org/sofpr/PR9116/build4627/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=verify-kernel-boot-log so supposedly some work needs to be done...

lyakh avatar May 13 '24 07:05 lyakh

SOFCI TEST

lyakh avatar Jun 05 '24 11:06 lyakh

https://github.com/thesofproject/sof/actions/runs/9387214836/job/25849589149?pr=9116

FAILED: zephyr/smart_amp_test_llext/smart_amp_test.llext D:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test_llext/smart_amp_test.llext 
C:\Windows\system32\cmd.exe /C "cd /D D:\a\sof\sof\workspace\build-mtl\zephyr\smart_amp_test_llext && D:\a\sof\sof\zephyr-sdk-0.16.4_windows-x86_64\zephyr-sdk-0.16.4\xtensa-intel_ace15_mtpm_zephyr-elf\bin\xtensa-intel_ace15_mtpm_zephyr-elf-strip.exe  -R.xt.* D:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test.llext.pkg_input -oD:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test_llext/smart_amp_test.llext  && "C:\Program Files\CMake\bin\cmake.exe" -E true"
D:\a\sof\sof\zephyr-sdk-0.16.4_windows-x86_64\zephyr-sdk-0.16.4\xtensa-intel_ace15_mtpm_zephyr-elf\bin\xtensa-intel_ace15_mtpm_zephyr-elf-strip.exe: 'D:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test.llext.pkg_input': No such file

https://sof-ci.01.org/sofpr/PR9116/build5189/build also failed.

marc-hb avatar Jun 05 '24 22:06 marc-hb

https://github.com/thesofproject/sof/actions/runs/9387214836/job/25849589149?pr=9116

FAILED: zephyr/smart_amp_test_llext/smart_amp_test.llext D:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test_llext/smart_amp_test.llext 
C:\Windows\system32\cmd.exe /C "cd /D D:\a\sof\sof\workspace\build-mtl\zephyr\smart_amp_test_llext && D:\a\sof\sof\zephyr-sdk-0.16.4_windows-x86_64\zephyr-sdk-0.16.4\xtensa-intel_ace15_mtpm_zephyr-elf\bin\xtensa-intel_ace15_mtpm_zephyr-elf-strip.exe  -R.xt.* D:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test.llext.pkg_input -oD:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test_llext/smart_amp_test.llext  && "C:\Program Files\CMake\bin\cmake.exe" -E true"
D:\a\sof\sof\zephyr-sdk-0.16.4_windows-x86_64\zephyr-sdk-0.16.4\xtensa-intel_ace15_mtpm_zephyr-elf\bin\xtensa-intel_ace15_mtpm_zephyr-elf-strip.exe: 'D:/a/sof/sof/workspace/build-mtl/zephyr/smart_amp_test.llext.pkg_input': No such file

https://sof-ci.01.org/sofpr/PR9116/build5189/build also failed.

@marc-hb fixed that. Could you or @fredoh9 help check why that's the case? We'd need to compare intermediate results EDIT: sorry, I meant to ask: why the windows / linux build comparison is failing now

lyakh avatar Jun 07 '24 08:06 lyakh

https://github.com/thesofproject/sof/actions/runs/9418758155/job/25947397562?pr=9116

Files linux-build mtl/build-sof-staging/sof/intel/sof-ipc4-lib/mtl/community/smart_amp_test.llext and windows-build mtl/build-sof-staging/sof/intel/sof-ipc4-lib/mtl/community/smart_amp_test.llext differ

Do such files have debug symbols? If yes then they shouldn't be compared, not until they're stripped (TODO)

marc-hb avatar Jun 07 '24 15:06 marc-hb

Do such files have debug symbols? If yes then they shouldn't be compared, not until they're stripped (TODO)

Actually, there's a better temporary solution: turn off modules when testing reproducible builds. Otherwise code in modules is not tested.

marc-hb avatar Jun 08 '24 01:06 marc-hb

Do such files have debug symbols? If yes then they shouldn't be compared, not until they're stripped (TODO)

@marc-hb makes sense, thanks! I'll try adding stripping .comment to Zephyr LLEXT cmake code

lyakh avatar Jun 12 '24 08:06 lyakh

Actually, there's a better temporary solution: turn off modules when testing reproducible builds. Otherwise code in modules is not tested.

@marc-hb not sure I understand - doesn't the failing test mean, that modules do get tested?

lyakh avatar Jun 12 '24 09:06 lyakh

@wszypelt QB stuck?

lyakh avatar Jun 18 '24 08:06 lyakh

@lyakh Unfortunately, I'm trying to solve this issue because more PRs are stuck. As long as I manually added it to the queue, the results should be available within an hour

wszypelt avatar Jun 18 '24 09:06 wszypelt

@lyakh can you rebase and re-push. Thanks !

lgirdwood avatar Jun 25 '24 15:06 lgirdwood

@lyakh can you rebase and re-push. Thanks !

@lgirdwood It isn't just about rebasing: we're waiting for 2 things to happen: (1) QB support for LLEXT modules @wszypelt , and (2) a solution on how to resolve the failing Linux-Windows comparison @marc-hb https://github.com/thesofproject/sof/pull/9116#issuecomment-2162490916

lyakh avatar Jun 26 '24 06:06 lyakh

@lyakh can you rebase and re-push. Thanks !

@lgirdwood It isn't just about rebasing: we're waiting for 2 things to happen: (1) QB support for LLEXT modules @wszypelt , and (2) a solution on how to resolve the failing Linux-Windows comparison @marc-hb #9116 (comment)

Ok, lets disable the Windows/Linux comparison here as we know the toolchain has some opens around building shared objects/libraries.

@wszypelt is there an ETA for when internal CI could support this build target ? Fwiw, @mwasko and I were discussing today. My preference would be to have this build option testable by all CIs for best coverage.

lgirdwood avatar Jun 26 '24 13:06 lgirdwood

Ok, lets disable the Windows/Linux comparison here as we know the toolchain has some opens around building shared objects/libraries.

@lgirdwood @marc-hb we could diff --exclude=*.llext "for now"

lyakh avatar Jun 26 '24 15:06 lyakh

Ok, lets disable the Windows/Linux comparison here as we know the toolchain has some opens around building shared objects/libraries.

@lgirdwood @marc-hb we could diff --exclude=*.llext "for now"

Yep, whatever is least effort.

lgirdwood avatar Jun 26 '24 15:06 lgirdwood

@lyakh please try to add CONFIG_LIBRARY_DEFAULT_MODULAR=n to repro-build.conf after thesofproject/sof#9264 is merged.

marc-hb avatar Jun 26 '24 22:06 marc-hb

@lyakh can you rebase and re-push. Thanks !

@lgirdwood It isn't just about rebasing: we're waiting for 2 things to happen: (1) QB support for LLEXT modules @wszypelt , and (2) a solution on how to resolve the failing Linux-Windows comparison @marc-hb #9116 (comment)

Ok, lets disable the Windows/Linux comparison here as we know the toolchain has some opens around building shared objects/libraries.

@wszypelt is there an ETA for when internal CI could support this build target ? Fwiw, @mwasko and I were discussing today. My preference would be to have this build option testable by all CIs for best coverage.

@lgirdwood I talked to the developer, there is already a solution, but we still have some problems with it, I honestly believe that everything will work by Monday

wszypelt avatar Jun 27 '24 08:06 wszypelt

linux-windows comparison is fixed now. next waiting for the MTL regression to be fixed and for a QB integration

lyakh avatar Jul 02 '24 09:07 lyakh

The LIBRARY_DEFAULT_MODULAR opt-in program looks like a complex Kconfig hack. It adds multiple levels of defaults, it is pretty verbose (need to edit the Kconfig of each component) and the only thing it seems to achieve is to avoid an .conf overlay a list of modules. Why not just do such an overlay? Keep it simple.

@marc-hb we already have such overlays, but I thought that overlays in default build configurations were frowned upon?

lyakh avatar Jul 03 '24 06:07 lyakh

sof-ipc4-lib/ is empty in https://github.com/thesofproject/sof/actions/runs/9776233023/job/26988310949?pr=9116, is that expected?

marc-hb avatar Jul 03 '24 21:07 marc-hb

sof-ipc4-lib/ is empty in https://github.com/thesofproject/sof/actions/runs/9776233023/job/26988310949?pr=9116, is that expected?

not sure why that one is empty, but I see the DRC module being loaded on MTL HDA https://sof-ci.01.org/sofpr/PR9116/build6159/devicetest/index.html?model=MTLP_RVP_HDA&testcase=verify-sof-firmware-load:

[    5.010647] kernel: snd_sof:sof_ipc4_fw_parse_ext_man: sof-audio-pci-intel-mtl 0000:00:1f.3: module DRC: UUID B36EE4DA-006F-47F9-A06D-FECBE2D8B6CE cfg_count: 1, bss_size: 0x1000
[    5.010672] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1c0]=0x40000 successful
[    5.010704] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1c0]=0x40000 successful
[    5.010711] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-mtl 0000:00:1f.3: period_bytes:0x0
[    5.010713] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-mtl 0000:00:1f.3: periods:1
[    5.010725] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x19000000|0x0: GLB_LOAD_LIBRARY_PREPARE
[    5.011798] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x39000000|0x0: GLB_LOAD_LIBRARY_PREPARE
[    5.011822] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x19000000|0x0: GLB_LOAD_LIBRARY_PREPARE
[    5.011830] kernel: snd_sof_intel_hda_common:hda_dsp_ipc4_load_library: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1d0]=0x409800 successful
[    5.011838] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x18010000|0x0: GLB_LOAD_LIBRARY
[    5.031498] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x38000000|0x0: GLB_LOAD_LIBRARY
[    5.031510] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x18010000|0x0: GLB_LOAD_LIBRARY

but yes, strange that it wasn't built in that test

lyakh avatar Jul 04 '24 08:07 lyakh

The rimage.path hack needs at least a minimum of explanation somewhere. Best is probably a new bug. Right now the hack has zero comment and is not even mentioned in any commit message.

@marc-hb here you go: https://github.com/thesofproject/sof/actions/runs/9791262388/job/27034644220?pr=9281

In dir: D:\a\sof\sof\workspace; running command:
    ''"'"'D:\a\sof\sof\workspace\build-rimage\rimage.EXE'"'"'' -o 'D:\a\sof\sof\workspace\build-mtl\zephyr\eq_iir_llext\eq_iir.llext.ri' -e -c 'D:\a\sof\sof\workspace\build-mtl\zephyr\eq_iir_llext\rimage_config.toml' -k 'D:\a\sof\sof\workspace\sof\keys\otc_private_key_3k.pem' -l -r 'D:\a\sof\sof\workspace\build-mtl\zephyr\eq_iir_llext\eq_iir.llext'

but I'd rather just fix it than create a bug for reference for it.

lyakh avatar Jul 04 '24 08:07 lyakh

sof-ipc4-lib/ is empty in https://github.com/thesofproject/sof/actions/runs/9776233023/job/26988310949?pr=9116, is that expected?

@marc-hb I know why - all those builds use --overlay=sof/app/overlays/repro-build.conf and that one disables CONFIG_MODULES

lyakh avatar Jul 04 '24 12:07 lyakh

CI:

  1. coding style: false positives for missing Kconfig "help" (it's present) and requiring parentheses in a UUID macro definition (@andyross) which would break it
  2. QB: need to clarify @wszypelt
  3. main-ace jenkins: this is the important one. And I think it's good now. The failures: 3.1. https://sof-ci.01.org/sofpr/PR9116/build6351/devicetest/index.html?model=MTLP_RVP_HDA&testcase=multiple-pause-resume-50 seems to be https://github.com/thesofproject/linux/issues/5048 although on MTL 3.2. sof-logger failed on all 3 platforms (HDA, SDW, nocodec), e.g. https://sof-ci.01.org/sofpr/PR9116/build6351/devicetest/index.html?model=MTLP_RVP_HDA&testcase=check-sof-logger is thesofproject/sof-test#1216 - also failed on LNL and TGL
  4. multiple LNL SDW failures https://sof-ci.01.org/sofpr/PR9116/build6350/devicetest/index.html must be unrelated, also seen e.g. in thesofproject/sof#9287 https://sof-ci.01.org/sofpr/PR9287/build6345/devicetest/index.html

lyakh avatar Jul 08 '24 15:07 lyakh

+1, all comments addressed. Based on DRC changes it looks nice to use, just make sure CI works

@abonislawski thanks, yes, we're looking into QB failures ATM

lyakh avatar Jul 09 '24 08:07 lyakh

@lyakh @abonislawski QB Internal CI now works correctly, DRC on MTL is checked, our all tests in Internal Intel CI in green

wszypelt avatar Jul 10 '24 08:07 wszypelt

@lyakh @abonislawski QB Internal CI now works correctly, DRC on MTL is checked, our all tests in Internal Intel CI in green

great! Thanks a lot @wszypelt !

lyakh avatar Jul 10 '24 08:07 lyakh

Hmm, @lyakh can you check this https://sof-ci.01.org/sofpr/PR9116/build6415/devicetest/index.html?model=MTLP_RVP_HDA&testcase=multiple-pause-resume-50

[ 1032.371199] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x13010004|0x0
[ 1032.371224] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 1032.371236] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0x93010004|0x0|0x0, target: 0x33000000|0x0|0x0, ctl: 0x3

kv2019i avatar Jul 10 '24 13:07 kv2019i

Hmm, @lyakh can you check this https://sof-ci.01.org/sofpr/PR9116/build6415/devicetest/index.html?model=MTLP_RVP_HDA&testcase=multiple-pause-resume-50

[ 1032.371199] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x13010004|0x0
[ 1032.371224] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 1032.371236] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0x93010004|0x0|0x0, target: 0x33000000|0x0|0x0, ctl: 0x3

@kv2019i I thought it was the same as https://github.com/thesofproject/linux/issues/5048 but (1) this one is on MTL, the other one is on LNL, and (2) this one seems to happen consistently with this PR while the LNL bug is rather rare? Is my understanding correct?

lyakh avatar Jul 10 '24 13:07 lyakh

@lyakh wrote:

@kv2019i I thought it was the same as thesofproject/linux#5048 but (1) this one is on MTL, the other one is on LNL, and (2) this one seems to happen consistently with this PR while the LNL bug is rather rare? Is my understanding correct?

It's the same test but at least the most recent failure case for this PR seems to have a IPC timeout. The known LNL fail looks like this https://sof-ci.01.org/sofpr/PR9235/build5580/devicetest/index.html?model=LNLM_RVP_HDA&testcase=multiple-pause-resume-50 -- user-space getting error status but no errors really in kernel/fw logs.

kv2019i avatar Jul 10 '24 14:07 kv2019i