sof [FEATURE][ChromeOS] Expose DSP load tracker info for the visibility to ChromeOS

Is your feature request related to a problem? Please describe. Currently we don't have much information for DSP in system logs attached in feedback report of audio issue. To debug the issue we need to re-build kernel with trace feature on and capture logs by sof-logger. It will be super helpful for debugging if DSP load tracker info could be obtained without those debugging steps. Moreover, we can setup an autotest item to track the DSP load performance for all supported products.

Describe the solution you'd like I don't have ideas which should be the appropriate/simple way to do this. It can be either actively reported from DSP, or on demands of the host request. For the active approach, it should be enough by reporting critical info only (e.g. DSP load is over the threshold) in order not to cause more burdens.

Describe alternatives you've considered None

Additional context None

Mar 11 '22 11:03 johnylin76

@plbossart @ujfalusi @ranj063 any thoughts ?

Mar 11 '22 22:03 lgirdwood

When this is implemented, feature needs to be documented to https://thesofproject.github.io/latest/developer_guides/debugability/perf-counters/index.html?highlight=performance

Mar 25 '22 08:03 kv2019i

I'd say an IPC to on-demand ask this information from FW would be the most suitable solution. So we could e.g. have a debugfs interface (exposed by SOF linux driver) and when CRAS queries this (at the rate it chooses), SOF drivers sends an IPC and FW replies with current DSP load tracker data.

Mar 25 '22 08:03 kv2019i

@kv2019i we have mtrace and Zephyr oops now, anything else pending ?

May 11 '23 20:05 lgirdwood

there is one line log may cover this topic:

[ 65.782163] ll_schedule: ll timer avg 2138, max 4647, overruns 0 [ 66.806163] ll_schedule: ll timer avg 2143, max 2924, overruns 0

simply check overruns can know whether dsp overloaded.

May 12 '23 02:05 btian1

@lgirdwood @btian1 This is not done. The request is to have this data in user-space without enabling the FW logs (disabled in Chromebooks by default). So this requires new DSP+kernel infra. In IPC4, we have messages to report load measurements, but these are not implemented yet (on FW nor on kernel side). So this is still a valid ask and I'm not aware of any ongoing work to implement these (we discussed about the IPC4 messages in question with the cycles-budget (CPC) topic around manifest).

May 12 '23 11:05 kv2019i

The request is to have this data in user-space without enabling the FW logs (disabled in Chromebooks by default). So this requires new DSP+kernel infra.

Why is this not just a log level configuration problem? https://docs.zephyrproject.org/latest/services/logging/index.html

Logging API is highly configurable at compile time as well as at run time. [...] During compilation logs can be filtered out on module basis and severity level. [...] Logs can also be compiled in but filtered on run time using dedicate API. Run time filtering is independent for each backend and each source of log messages. Source of log messages can be a module or specific instance of the module.

The old sof-logger has similar capabilities.

So the idea to start another, new and entirely different logging framework sounds weird to me. @johnylin76, @andyross, @cujomalainey why are Chromebooks "throwing the baby with the bathwater" and disabling logs entirely as opposed to selectively keep the logging they need/want?

May 12 '23 21:05 marc-hb

@marc-hb at least on pre-ace systems I have been warned that enabling logging hijacks a DMA (which PNP is not happy about when it comes to power) and also it can block power states. Trust me I would love to have even just ERR/WARN on.

May 15 '23 21:05 cujomalainey

The old sof-logger uses either shared memory for high levels or DMA for all levels or both at the same time. It's configurable.

I'm not familiar with all Zephyr logging possibilities yet but I know for sure that the way it's configured does NOT use any DMA right now in our CI, it relies on shared memory "mtrace" only (which means it can overrun and lose statements; "you get what you paid for")

May 15 '23 22:05 marc-hb

@cujomalainey @marc-hb That's a good question. "mtrace" Zephyr backend is friendlier to system flows (it doesn't take a DMA channel). but it's still not zero cost. There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size (although we already have dictionary formats in Zephyr up and running limiting this). But in the end it will depend on the specific product. Our current assumption is still that at least not all release firmwares will have logging enabled (like has been the case with SOF previously). I.e. key debugging tools (like ability to extract a FW coredump), will be available even if you don't have logging enabled. Having a way to access DSP load information without logging, seems like a reasonable feature, similar to the coredump. I think we are running out of runway for v2.6 though, so this probably needs to be pushed out to v2.7.

May 17 '23 16:05 kv2019i

There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size (although we already have dictionary formats in Zephyr up and running limiting this).

Logging never comes for free; OK. Something Heisenberg something?

But how would a new technique focused on logging the DSP load only avoid all these costs? By being uber-specialized and optimized for reporting DSP load only? Then someone else realizes that "oh, wait, we also need the ability to log this other critical thing too in production builds... and that other one..."

Maybe I'm missing something but until fine-tuning the configuration of the existing logging framework has not been attempted and measured, starting a new brand new technique seems to fall in the "premature optimization" category.

I.e. key debugging tools (like ability to extract a FW coredump), will be available even if you don't have logging enabled

If that's already available without logging then it's different; it does really come "for free". Coredumps feel quite different from regular logging too; for instance you don't need a long-running thread streaming a series of coredumps :-)

May 17 '23 18:05 marc-hb

There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size

Will happily pay that price if it means debugging time is cut because we are not having blind errors in the field

Our current assumption is still that at least not all release firmwares will have logging enabled (like has been the case with SOF previously)

i think the bigger issue was that we didn't enable it in the current which meant the resources were never allocated. So even though the firmware was ready the kernel was enabled.

I think the load data could even be host triggered for our purposes into the log, but we need to be able to run the log in production without PNP getting annoyed.

May 17 '23 19:05 cujomalainey

@kv2019i , agree with your comment on running out of runway for v2.6. Moving to v2.7.

May 23 '23 22:05 alex-cri

@kv2019i can we get the overall LL load per core in SRAM window ? @ujfalusi will probably need a kernel debugFS to read it.

Jul 04 '23 15:07 lgirdwood

SOF drivers sends an IPC and FW replies with current DSP load tracker data.

@ujfalusi will probably need a kernel debugFS to read it.

These seem to imply a "poll" style of user interface where a process must run and regularly poll the firmware.

The description and some other comments seem to lean more towards a "push" style of interface where the firmware regularly logs the data somehow and somewhere (a different question). In the latter case, the interval is controlled by the firmware so it is more predictable.

@cujomalainey (who ~filed this issue~ has been requesting this feature) do you have a preference?

Jul 05 '23 01:07 marc-hb

SOF drivers sends an IPC and FW replies with current DSP load tracker data.

@ujfalusi will probably need a kernel debugFS to read it.

These seem to imply a "poll" style of user interface where a process must run and regularly poll the firmware.

The description and some other comments seem to lean more towards a "push" style of interface where the firmware regularly logs the data somehow and somewhere (a different question). In the latter case, the interval is controlled by the firmware so it is more predictable.

SRAM is FW push. DebugFS can be blocking read() for an update every 1sec using existing FW MCPS infra.

Jul 05 '23 10:07 lgirdwood

No preference, @johnylin76 thoughts?

Jul 05 '23 20:07 cujomalainey

The DebugFS approach @lgirdwood mentioned above looks good to me. I suppose that the feature doesn't affect DSP idle state, which loads are calculated and reported only when DSP core is active. Is that correct?

Jul 07 '23 00:07 johnylin76

The DebugFS approach @lgirdwood mentioned above looks good to me. I suppose that the feature doesn't affect DSP idle state, which loads are calculated and reported only when DSP core is active. Is that correct?

Correct, this data is tracked already today within FW. Reporting it every 1 second via debugFS should not be too dificult.

Long term it would be great to have a reporting mechanism where we could see all threads and system loads per core, number of IRQs etc, but this is something that Zephyr will need to export. @andyross maybe something you can look at long term ?

Jul 07 '23 12:07 lgirdwood

We have had lot of focus on the per-module measurements infra in v2.8 cycle, so this has not progressed. The ask is still valid, so let's move it to v2.9 and see if we can finally do the final bit of publishing this to the host side.

Nov 24 '23 17:11 kv2019i

We have had lot of focus on the per-module measurements infra in v2.8 cycle, so this has not progressed. The ask is still valid, so let's move it to v2.9 and see if we can finally do the final bit of publishing this to the host side.

Lets target LL 1st then DP 2nd. I think we have all the data collected now in FW for LL as CI is able to parse the logs per module.

Nov 27 '23 11:11 lgirdwood

@jsarha for v2.9 we need something initially that can provide data in an unused RO SRAM Window (for Intel - other Vendors will choose their memory locations) that can be read by debugFS as binary data with userspace parsing i.e. a Python script would read and parse the debugFS and display the contents. Later on it could use other transports like probes.

The SRAM Windows and FW data collection will be disabled by default - it will be enabled at runtime via a new IPC message. i.e. the IPC to enable can be sent by kernel when debugFS is open() and IPC to disable when debugFS is close().
There will be a second debugFS that allows for the FW data collection to be tuned to the needs of the developer. Each data payload will have an ID (iniut32_t) and if host echos the ID to debug FS then that sends an IPC (set large config) to enable that feature i.e.

echo 123, 789, -456 > /sys/debug/blah/sof_blah

Will enable data items with ID 123, 789 and disable ID 456 (negative).

The FW data wont be parsed by the kernel (except for the total size & magic). It will just be passed through. It should have a generic header that is easy to parse e.g.

struct payload_hdr {
    uint32_t total_size; // total size of the whole payload in bytes
    uint32_t hdr_size; // size of this header only in bytes
    uint32_t magic; // used to identify valid data
    uint32_t abi;  // ABI version, can be used by tool to warn users if too old.
    uint64_t tstamp; // aligned with host epoch.
};

Following this will be the data structures/containers. They should all lead with a standard header that the tool can use e.g.

struct elem_hdr {
    uint32_t size; // size of this elem in bytes;
    uint32_t id; // ID of this elem type
};
/* elem body follows here */

This way the Python tool can read and parse, it should ignore any IDs it does not know about.

Jan 08 '24 12:01 lgirdwood

Stable-v2.9 branched, this didn't make the cut, bumping to 2.10. There is progress with both Zephyr thread analyzer and with a more generic telemetry interface, but neither is ready for merge for SOF2.9.

Mar 04 '24 13:03 kv2019i

@jsarha can you give an update of any PRs and output to date. Thanks !

Apr 25 '24 09:04 lgirdwood

I have some couple of PRs to address this issue. First there is Intel ADSP fixes for Zephyr Thread Analyzer:

https://github.com/zephyrproject-rtos/zephyr/pull/71408

And then couple two other PRs, depending on the first one, that still need some polishing, but which are functional and working at the moment:

The telemetry2 and thread info FW code, and the python client to decode the associated debugfs.

https://github.com/thesofproject/sof/pull/9084

Simple debugfs code to Linux sof driver to map the telemetry2 window to a debugfs file.

https://github.com/thesofproject/linux/pull/4956

Apr 26 '24 11:04 jsarha

Here is example output of the current implementation:

python3 thread-analyzer.py 
Open /sys/kernel/debug/sof/telemetry2 failed '[Errno 14] Bad address'
core_count: 3
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 0%
        idle 02              cpu 99.6% stack 27.1%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 6%
        ll_thread2           cpu 0.0% stack 18.8%
        idle 02              cpu 93.3% stack 28.6%
        0x40118570           cpu 0.4% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 5%
        ll_thread2           cpu 5.5% stack 18.8%
        idle 02              cpu 94.1% stack 28.6%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 5%
        ll_thread2           cpu 5.5% stack 18.8%
        idle 02              cpu 94.1% stack 28.6%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 5%
        ll_thread2           cpu 5.5% stack 18.8%
        idle 02              cpu 94.1% stack 28.6%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 1%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.8% stack 19.6%
        idle 00              cpu 98.4% stack 33.7%

Apr 26 '24 14:04 jsarha

The SRAM Windows and FW data collection will be disabled by default - it will be enabled at runtime via a new IPC message. i.e. the IPC to enable can be sent by kernel when debugFS is open() and IPC to disable when debugFS is close().

@lgirdwood What if you want this FW data collection to start immediately without waiting for user-space to enable this with an IPC?

Take the example of the trace, it's on immediately. userspace can select to read it or not.

Apr 30 '24 14:04 plbossart

The SRAM Windows and FW data collection will be disabled by default - it will be enabled at runtime via a new IPC message. i.e. the IPC to enable can be sent by kernel when debugFS is open() and IPC to disable when debugFS is close().

@lgirdwood What if you want this FW data collection to start immediately without waiting for user-space to enable this with an IPC?

Take the example of the trace, it's on immediately. userspace can select to read it or not.

Kconfig initially, but there could be later updates that allow developers to switch on/of via other methods.

Apr 30 '24 15:04 lgirdwood

@lgirdwood if it's a Kconfig, then why would you need an IPC to turn the data collection on?

Apr 30 '24 15:04 plbossart

I don't know about this case specifically but you can have a default value in Kconfig for boot time and the ability to change it again at runtime, it's not unheard of.

(you can also have the Kconfig ability to strip the code and save space but that's a different Kconfig)

Apr 30 '24 16:04 marc-hb