sof icon indicating copy to clipboard operation
sof copied to clipboard

pipeline: register KCPS consumption and enable dynamic clock control

Open abonislawski opened this issue 2 years ago • 30 comments

This will enable dynamic clock control based on KCPS budget.

Mostly cherry-picked from mtl-005-drop-stable + kconfig

pipeline: Register and unregister pipelines CPS consumption on run/pause +Squashed commits: https://github.com/thesofproject/sof/commit/924f0d26db2cc68c819185dd42cd491de72f8b5d pipeline: print warning if 0 KCPS received

https://github.com/thesofproject/sof/commit/7bb6e7ef3ba93d6406226a5fbe8476f0dc1098c7 pipeline: register CPS consumption for a proper core

https://github.com/thesofproject/sof/commit/5f0c4e4c3160252d3794feec7d1ac1303f5f9cd1 pipeline: make CPC data "opt-in" with fallback

platform: register basefw CPS consumption on boot

+Squashed commit: https://github.com/thesofproject/sof/commit/38b4cb1475fb3f3828cbd7a117ec020ae2f57a12 platform: mtl: lowest clock as default

boards: mtl: enable KCPS dynamic clock

abonislawski avatar Aug 09 '23 07:08 abonislawski

I think this fixes some reported issues.

Unfortunately not, I will post more info in issue thread

abonislawski avatar Aug 17 '23 09:08 abonislawski

@abonislawski , could you list here how you get each component cpc data? by below ways?

  1. cpc = avg.
  2. cpc = avg * 110%
  3. cpc = avg * 150%?
  4. cpc = (avg+peak) / 2

btian1 avatar Aug 31 '23 01:08 btian1

I think this fixes some reported issues.

Unfortunately not, I will post more info in issue thread

Good now ?

lgirdwood avatar Sep 14 '23 15:09 lgirdwood

@abonislawski should we upstream with Kconfig = n until we can give full validation cycle. i.e. that would unblock the code merging and would allow more testing without impacting other users.

lgirdwood avatar Sep 25 '23 15:09 lgirdwood

@lgirdwood @abonislawski , I have a comment on clock adjust: it should only need be enabled on release build, and keep debug build as always MAX clock.

Debug build no need to perform clock adjust!

Also, Our CI performance and CPC data is based on release build, please take care on this.

No, debug build should also use CPC if selected in Kconfig (but we may need a Kconfig debug option to increase teh CPC calculation since we will be printing more logs). i.e. Kconfig CONFIG_DEBUG_CPC_ADJUST which will add X to each CPC to factor in logging costs.

lgirdwood avatar Oct 17 '23 15:10 lgirdwood

@lgirdwood @abonislawski , I have a comment on clock adjust: it should only need be enabled on release build, and keep debug build as always MAX clock. Debug build no need to perform clock adjust! Also, Our CI performance and CPC data is based on release build, please take care on this.

No, debug build should also use CPC if selected in Kconfig (but we may need a Kconfig debug option to increase teh CPC calculation since we will be printing more logs). i.e. Kconfig CONFIG_DEBUG_CPC_ADJUST which will add X to each CPC to factor in logging costs.

debug normally = release * 120%, question is: what's the point to perform clock adjust in debug build?

btian1 avatar Oct 18 '23 02:10 btian1

Im back to this and fixing review comments now

abonislawski avatar Oct 18 '23 07:10 abonislawski

Another benefit for remove clock control for debug build: whenever glitch or discontinue issue happened, we can have a try on debug build directly, if no issue, it is CPC caused clock underrun issue, no need to manually disable clock control for check.

btian1 avatar Oct 25 '23 02:10 btian1

Lot of RED on the CI, will rerun to confirm. Could be we are missing updated CPC values for some modules though.

lgirdwood avatar Oct 25 '23 15:10 lgirdwood

SOFCI TEST

lgirdwood avatar Oct 25 '23 15:10 lgirdwood

My comments have been addressed, so marking this as good to go.

@btian1 comment w.r.t. how we handle debug build with asserts is an interesting one. There is no question asserts will slow down the code, so the CPC budget should be adjusted. I think one option would be to disable budget based DSP speed adjustements in debug builds (or some new mechanism to add a +10% to budgets if doing a debug build). Both options have their own pro/cons.

The reason we do clock adjust is we want to save some power for customer build, now customer build only use release build, if debug build also need apply clock adjust, it is hard to control the CPC, 10% margin 20% margin? we don't know this exactly, and for some corner cases, it maybe 50%, it may bring extra issues that we can't expect before.

btian1 avatar Oct 27 '23 02:10 btian1

@btian1 wrote:

@kv2019i wrote:

@btian1 comment w.r.t. how we handle debug build with asserts is an interesting one. There is no question asserts will slow down the code, so the CPC budget should be adjusted. I think one option would be to disable budget based DSP speed adjustements in debug builds (or some new mechanism to add a +10% to budgets if doing a debug build). Both options have their own pro/cons. The reason we do clock adjust is we want to save some power for customer build, now customer build only use release build, if debug build also need apply clock adjust, it is hard to control the CPC, 10% margin 20% margin? we don't know this exactly, and for some corner cases, it maybe 50%, it may bring extra issues that we can't expect before.

That leaves then only possibility to disable CPC based adjustments in debug builds. This is not ideal as the delta between debug and release builds grow -- we want to use debug builds in CI as e.g. asserts are very useful to find programming mistakes in submitted PRs. Not sure if we have capacity to test both release and debug builds (FYI @keqiaozhang @mengdonglin @marc-hb @fredoh9 and @lyakh @ujfalusi with whom this was discussed recently ).

kv2019i avatar Oct 27 '23 12:10 kv2019i

@abonislawski I do think there is something else wrong still with this PR. We have a lot of fails in the MTL tests. The logs are not published to 01.org now, but errors look like:

[  202.698774] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x13000003|0x1: GLB_SET_PIPELINE_STATE [data size: 12]
[  203.201276] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x13000003|0x1

FW side:

[   10.146323] <inf> ipc: ipc_cmd: rx	: 0x13000003|0x1
[   10.146340] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[   10.146346] <inf> pipe: remove_pipeline_cps_consumption: Unregistering module: 0x4 KCPS consumption: 7116, core: 0
[   10.146351] <inf> pipe: remove_pipeline_cps_consumption: Unregistering module: 0x10 KCPS consumption: 20663, core: 0
[   10.146358] <inf> pipe: remove_pipeline_cps_consumption: Unregistering module: 0x10004 KCPS consumption: 7116, core: 0
[   10.146381] <inf> clock: clock_set_freq: clock 0 set freq 23024000Hz freq_idx 0
[   10.146426] <inf> clock: clock_set_freq: clock 1 set freq 23024000Hz freq_idx 0
[   10.146510] <inf> clock: clock_set_freq: clock 2 set freq 23024000Hz freq_idx 0
[   10.373946] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 9796, max 42745, overruns 39```

So clocks are reduced quite a lot and this leads to serious amount of overruns (39 above -> during one 1sec window).

One option to try is to disable the CPC based adjustments in app/debug_overlay.conf . This at least tells whether the problem is CPC budgets not matching actual performance when debug is enabled (and it is enabled in PR tests!) or is this caused by something else.

kv2019i avatar Oct 27 '23 12:10 kv2019i

if debug build also need apply clock adjust, it is hard to control the CPC, 10% margin 20% margin? we don't know this exactly, and for some corner cases, it maybe 50%, it may bring extra issues that we can't expect before.

Good point but I think running the same code paths by default is much more important. The ONLY difference between debug and release builds should be asserts, logging, timings and performance.

If 50% is not enough then there is too much logging or you just found a performance bug in logging. Both are good to find.

(Of course you should be able to easily turn off clock control when debugging interactively; this discussion is only about the defaults used in validation)

You could argue that the dynamic clock control code makes only timing differences and does not affect the rest of the code more than logging. I disagree for two reasons:

  • Clock control is much more complex than logging hence much more likely to have bugs that affect the rest of the code
  • Clock control code needs to be TESTED in debug builds too!

Signing is a similar problem: every "developer mode" in every product I have ever seen is implemented using a special insecure keys. All of them keep running the SAME CODE in developer mode. Because you do not want to wait release time to start exercising signing and authentication code; that's too late (and even when running the same signing code we STILL find last minute bugs like #8281 and https://github.com/thesofproject/rimage/pull/123)

marc-hb avatar Oct 27 '23 16:10 marc-hb

@abonislawski should we upstream with Kconfig = n until we can give full validation cycle. i.e. that would unblock the code merging and would allow more testing without impacting other users.

For the usual continuous integration reasons, @abonislawski please turn the feature OFF at the last minute once you get a mostly clean run, all review comments addressed and the PR approved.

After a few days submit another, one-line PR that just turns the feature ON by default.

These two separate steps will give us the necessary perspective in daily test results (which stress the code much more than PR tests)

marc-hb avatar Oct 27 '23 16:10 marc-hb

@btian1 wrote:

@kv2019i wrote:

@btian1 comment w.r.t. how we handle debug build with asserts is an interesting one. There is no question asserts will slow down the code, so the CPC budget should be adjusted. I think one option would be to disable budget based DSP speed adjustements in debug builds (or some new mechanism to add a +10% to budgets if doing a debug build). Both options have their own pro/cons. The reason we do clock adjust is we want to save some power for customer build, now customer build only use release build, if debug build also need apply clock adjust, it is hard to control the CPC, 10% margin 20% margin? we don't know this exactly, and for some corner cases, it maybe 50%, it may bring extra issues that we can't expect before.

That leaves then only possibility to disable CPC based adjustments in debug builds. This is not ideal as the delta between debug and release builds grow -- we want to use debug builds in CI as e.g. asserts are very useful to find programming mistakes in submitted PRs. Not sure if we have capacity to test both release and debug builds (FYI @keqiaozhang @mengdonglin @marc-hb @fredoh9 and @lyakh @ujfalusi with whom this was discussed recently ).

yes, this is a middle way, @abonislawski , could you add a patch to disable clock control in performance build? Add CONFIG_KCPS_DYNAMIC_CLOCK_CONTROL=n to app/perf_overlay.conf

btian1 avatar Oct 31 '23 03:10 btian1

@abonislawski should we upstream with Kconfig = n until we can give full validation cycle. i.e. that would unblock the code merging and would allow more testing without impacting other users.

For the usual continuous integration reasons, @abonislawski please turn the feature OFF at the last minute once you get a mostly clean run, all review comments addressed and the PR approved.

After a few days submit another, one-line PR that just turns the feature ON by default.

These two separate steps will give us the necessary perspective in daily test results (which stress the code much more than PR tests)

Ack - this will help mimize impact or help bisect.

lgirdwood avatar Oct 31 '23 15:10 lgirdwood

@marcinszkudlinski btw, any thoughts on measuring DP workloads in terms of MCPS ? It seems like we need to add timer logic to schedule() and yield() in Zephyr.

lgirdwood avatar Nov 10 '23 16:11 lgirdwood

@abonislawski will the latest update align with @marcinszkudlinski DP work ?

lgirdwood avatar Nov 23 '23 16:11 lgirdwood

@lgirdwood aligned with DP work in the latest push

abonislawski avatar Dec 12 '23 09:12 abonislawski

@lgirdwood aligned with DP work in the latest push

Great - @marcinszkudlinski good for you ? if so, pls approve.

lgirdwood avatar Dec 18 '23 16:12 lgirdwood

Build is failing:

/srv/home/jenkins/workspace/sof_config_build@2/sof/src/audio/pipeline/pipeline-stream.c:415:17: error: use of undeclared identifier 'data'

https://sof-ci.01.org/sofpr/PR8019/build1715/build/firmware_community/mtl.log

kv2019i avatar Jan 10 '24 11:01 kv2019i

@marcinszkudlinski can you re-review and approve if good now for you . Thanks !

lgirdwood avatar Jan 10 '24 12:01 lgirdwood

Build failures in https://github.com/thesofproject/sof/actions/runs/7667629748/job/20897861756?pr=8019 and https://github.com/thesofproject/sof/actions/runs/7667629757/job/20897863643?pr=8019

I suspect you are just missing some #ifdef.

The errors are difficult to find until .github fixes #8807 and #8799 are merged: please help review.

marc-hb avatar Jan 27 '24 01:01 marc-hb

MTL hardware tests are almost all failed in https://sof-ci.01.org/sofpr/PR8019/build2360/devicetest/index.html

While test results have not been great recently. I've never seen so much red recently so it is most likely caused by this PR (currently at 989bda8bcbaa9de66c60ea82be8ceabf1b1bbe03)

marc-hb avatar Jan 27 '24 01:01 marc-hb

All MTL tests still fail: https://sof-ci.01.org/sofpr/PR8019/build2392/devicetest/index.html

There also a lot of build failures: https://github.com/thesofproject/sof/actions/runs/7693205483/job/20961497200?pr=8019 https://github.com/thesofproject/sof/actions/runs/7693205482/job/20961494915?pr=8019 https://github.com/thesofproject/sof/actions/runs/7693205487/job/20961494936?pr=8019

marc-hb avatar Jan 30 '24 17:01 marc-hb

@abonislawski re the CI test failures, are we missing anything on the kernel side. i.e. does kernel need to do any IP programming flow around DSP clock change ? Fwiw, I know we have some loops that spin around a clock register bit change, could some be timing out or spinning forever ?

[  407.529254] kernel: snd_sof:sof_pcm_trigger: sof-audio-pci-intel-mtl 0000:00:1f.3: pcm: trigger stream 6 dir 1 cmd 0
[  407.529276] kernel: snd_sof:sof_ipc4_trigger_pipelines: sof-audio-pci-intel-mtl 0000:00:1f.3: trigger cmd: 0 state: 3
[  407.529297] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x13000003|0x1: GLB_SET_PIPELINE_STATE [data size: 12]
[  408.031809] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x13000003|0x1
[  408.031838] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Attempting to prevent DSP from entering D3 state to preserve context
[  408.031847] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[  408.031879] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0x93000003|0x1|0x0, target: 0x1b060000|0x0|0x0, ctl: 0x3
[  408.031887] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[  408.031894] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  408.031899] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[  408.031904] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[  408.031930] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x50000005: module: ROM_EXT, state: FW_ENTERED, running
[  408.031942] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware state: 0x0, status/error code: 0x0
[  408.031970] kernel: snd_sof:sof_ipc4_find_debug_slot_offset_by_type: sof-audio-pci-intel-mtl 0000:00:1f.3: Slot type 0x4c455400 is not available in debug window
[  408.031975] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  408.032005] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: failed to pause all pipelines
[  408.032017] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at soc_component_trigger on 0000:00:1f.3: -110
[  408.032032] kernel:  DMIC Raw: ASoC: trigger FE cmd: 0 failed: -110

lgirdwood avatar Jan 31 '24 14:01 lgirdwood

@lgirdwood We shouldnt need anything from kernel side as this is already used on MTL branches. It seems that everything crashes on debug build (default for PR testing in linux CI) and works on release build (default for PR testing on second internal CI), I need to investigate why

Fwiw, I know we have some loops that spin around a clock register bit change, could some be timing out or spinning forever ?

I will check that part

abonislawski avatar Jan 31 '24 18:01 abonislawski

@abonislawski wrote:

@lgirdwood We shouldnt need anything from kernel side as this is already used on MTL branches. It seems that everything crashes on debug build (default for PR testing in linux CI) and works on release build (default for PR testing on second internal CI), I need to investigate why

At least in some cases, there are overruns recorded. This can be due to CPC being too low for debug build, leading to too low DSP freq used for a use-case: https://sof-ci.01.org/sofpr/PR8019/build2392/devicetest/index.html?model=MTLP_RVP_HDA&testcase=check-capture-3times

[  407.529335] <inf> pipe: pipeline_remove_cps_consumption: Unregistering module: 0x10004 CPC: 7116, period: 1000, core: 0
[  407.529360] <inf> clock: clock_set_freq: clock 0 set freq 23024000Hz freq_idx 0
[  407.529383] <inf> clock: clock_set_freq: clock 1 set freq 23024000Hz freq_idx 0
[  407.529460] <inf> clock: clock_set_freq: clock 2 set freq 23024000Hz freq_idx 0
[  407.587696] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 4893, max 37788, overruns 2
...
[  408.045146] <err> ipc: ipc_comp_free: ipc_comp_free(): comp id: 4 state is 4 cannot be freed
[  408.045195] <err> ipc: ipc_pipeline_free: ipc_pipeline_free(): module free () failed

kv2019i avatar Jan 31 '24 19:01 kv2019i

Still failure here on MTL nocodec https://sof-ci.01.org/sofpr/PR8019/build3645/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=check-capture-10sec

kv2019i avatar Mar 26 '24 18:03 kv2019i