sof Add native_posix-based Zephyr emulator

First cut at a rig to run SOF as a Zephyr native_posix application, allowing ASAN/MSAN/fuzz testing of the full OS, including (mocked) driver layers. Think of it as CONFIG_LIBRARY on steroids. (It's also been a very useful exercise for me to get my hands on and head around the SOF platform/arch layers).

This is just the framework for early review right now, so no need to merge as it has no drivers and doesn't do anything yet.

West build SOF with -b native_posix -- -DCONFIG_ZEPHYR_POSIX=y -DCONFIG_ZEPHYR_NATIVE_DRIVERS=y. It also needs a few fixups on the Zephyr side which I'll push for review and link below.

Aug 29 '22 19:08 andyross

Needed Zephyr fixups at: https://github.com/zephyrproject-rtos/zephyr/pull/49641 (neither is a build dependency, they can merge in either order)

Heh, looks like it already has a collision. Will update for the next iteration as I start wiring in dma & ipc.

Aug 29 '22 19:08 andyross

@marc-hb how can we work this into CI, should be build-able and run-able as a host executable. @cujomalainey @andyross whats test could we run in CI ? It should be possible to enable valgrind, but do we have any UTs that could be run ?

Aug 31 '22 10:08 lgirdwood

@marc-hb how can we work this into CI, should be build-able and run-able as a host executable. @cujomalainey @andyross whats test could we run in CI ? It should be possible to enable valgrind, but do we have any UTs that could be run ?

Sorry for the late reply, this work is expanding the fuzzer.

@andyross can you rebase this series?

Nov 22 '22 20:11 cujomalainey

@marc-hb how can we work this into CI, should be build-able and run-able as a host executable.

If it can be easily be run with a couple west commands then it should be easy enough for @andyross to add a new file in .github/workflows/ :-) Plenty of existing examples there and of course I can help.

Nov 22 '22 21:11 marc-hb

@andyross will be forking the stable-v2.4 branch this week. If you can rebase we can take for v2.4

Nov 23 '22 11:11 lgirdwood

Dust this off and make it do stuff. This is now a reasonably complete fuzzing rig, I'm building it locally via:

ZEPHYR_TOOLCHAIN_VARIANT=llvm west build -p -b native_posix ../modules/audio/sof/app -- -DCONFIG_ASSERT=y -DCONFIG_SYS_HEAP_BIG_ONLY=y -DCONFIG_ZEPHYR_NATIVE_DRIVERS=y -DCONFIG_ARCH_POSIX_LIBFUZZER=y -DCONFIG_ARCH_POSIX_FUZZ_TICKS=100 -DCONFIG_ASAN=y

(You do have to use clang, as gcc lacks libfuzzer even though it can do ASAN. Frustratingly MSAN isn't an option even though it's supported by Zephyr, as it doesn't work for 32 bit executables and I haven't put in the effort to make SOF build for native_posix_64 yet).

Once built, just run build/zephyr/zephyr.exe with a single argument specifying a corpus directory for it to write test cases (you don't technically need this, but it works much better if it knows what it's done in the past), and let it run until something fails. There are 2-3 undiagnosed crashes in the tree for sure that I see regularly. I fixed two fairly obvious ones already (and will submit those in separate PRs if needed).

This isn't "done" done, really. There will always be more work to push coverage up. In particular the DMA layer has had the timer I wrote removed because it wasn't being hit in practice (I'll get this fixed ASAP, we need that), and the DAI stubs need to be smarter in order to exercise all the variant code paths.

But it's at a level now where it's finding real bugs (thought to be fair: some of them are bugs in the rig!), so it's worth having I think.

Dec 04 '22 16:12 andyross

Clean up, fixed up the checkpatch errors (needed licenses in particular).

Dec 06 '22 17:12 andyross

looks like checkpatch still has some legitimate complaints

Dec 06 '22 18:12 cujomalainey

This is definitely producing results. If anyone wants to join in, there's a pretty reproducible case this is able to hit where the "pipeline" pointer in the struct comp_dev pointed to by the "cd" union field of a struct ipc_comp_dev is turning up garbage at runtime given a little time in the fuzzer. It's not a heap pointer for sure.

The typing is a little tangled to my eyes, and I think this could use some expert attention. I'm currently thinking this is a type mixup (this pointer path involves a union in ipc_comp_dev) where the IPC protocol can fool the firmware into using an object of the wrong type at runtime? Seems plausible...

Dec 06 '22 18:12 andyross

Yeah, that's exactly what it was. The IPC3 "stream" command handlers weren't checking the type of the component object, but the object ID is part of the external command, so it was possible to pull out incompatible objects and follow the resulting garbage pointer. Fixed in this PR (though the same bug likely exists in IPC4 too!).

(And yeah, I'd dropped a patch during rebasing, checkpatch is error free and mostly happy now, remaining warnings are false positives conforming to APIs usage this PR can't change).

Dec 06 '22 19:12 andyross

This is now running without problems for ~hour long single-core fuzz runs. It seems to be getting through most of the topology and component initialization paths. That's probably worth merging as-is just for regression coverage. I'm working right now on figuring out why my DMA callback paths aren't being hit, which should get us most of the way to the finish line I hope.

Dec 06 '22 19:12 andyross

This is definitely producing results. If anyone wants to join in, there's a pretty reproducible case this is able to hit where the "pipeline" pointer in the struct comp_dev pointed to by the "cd" union field of a struct ipc_comp_dev is turning up garbage at runtime given a little time in the fuzzer. It's not a heap pointer for sure.

The typing is a little tangled to my eyes, and I think this could use some expert attention. I'm currently thinking this is a type mixup (this pointer path involves a union in ipc_comp_dev) where the IPC protocol can fool the firmware into using an object of the wrong type at runtime? Seems plausible...

Wouldn't be the first time where there is a bug due to trusting IPC type information about an object.

Nice find :)

Dec 06 '22 20:12 cujomalainey

The https://sof-ci.01.org/sofpr/PR6210/build2683/devicetest/index.html?model=TGLU_RVP_NOCODEC_IPC4ZPH&testcase=check-kmod-load-unload-after-playback failure is a bit worrying. Re-running that test.

Dec 07 '22 05:12 marc-hb

SOFCI TEST

Dec 07 '22 05:12 marc-hb

https://sof-ci.01.org/sofpr/PR6210/build2697/devicetest/index.html failed the same and I found it elsewhere too. It's a recent, unrelated regression, I just filed

#6735

Dec 07 '22 07:12 marc-hb

SOFCI TEST

Dec 07 '22 13:12 lgirdwood

This is now running without problems for ~hour long single-core fuzz runs. It seems to be getting through most of the topology and component initialization paths. That's probably worth merging as-is just for regression coverage. I'm working right now on figuring out why my DMA callback paths aren't being hit, which should get us most of the way to the finish line I hope.

@andyross seeing unrelated container errors https://github.com/thesofproject/sof/actions/runs/3633086599/jobs/6129697892 so will rerun, unlikely to be this PR, looks more like GH container infra.

Dec 07 '22 13:12 lgirdwood

Same ModuleNotFoundError: No module named 'google.cloud.datastore_v1.services' also spotted in unrelated PR #6732

Dec 07 '22 15:12 marc-hb

Same ModuleNotFoundError: No module named 'google.cloud.datastore_v1.services' also spotted in unrelated PR #6732

@andyross @cujomalainey any idea if the service is down ?

Dec 07 '22 15:12 lgirdwood

SOFCI TEST

Dec 07 '22 15:12 miRoox

@andyross can we have a paragraph with some instruction on how to use this mode for sof-docs ? @marc-hb once instructions are known, can we add build and runtime smoke test for host Zephyr to CI ?

Dec 07 '22 17:12 lgirdwood

Better than documentation, can we have a script that not just CI but anyone can run? I much prefer a quick and dirty but functional script than documentation as a starting point.

Is this still using oss-fuzz as mentioned last year in https://github.com/thesofproject/sof/pull/4132#issuecomment-843465052 ? Links to documentation there. That documentation mentions some Docker images, that should help with scripting?

Something like https://github.com/thesofproject/sof/blob/main/scripts/docker-run.sh , https://github.com/thesofproject/sof/blob/main/zephyr/docker-run.sh etc.

Dec 07 '22 17:12 marc-hb

Same ModuleNotFoundError: No module named 'google.cloud.datastore_v1.services' also spotted in unrelated PR #6732

@andyross @cujomalainey any idea if the service is down ?

Looks like latest pushes have been fixed, I checked upstream yesterday and there was no mention of it

Dec 07 '22 17:12 cujomalainey

@andyross I think it would best to submit the actual code fixes in separate PRs.

If any of these fix needs some actual review then review will be impossible here, can't review totally different things in the same PR (no threading)
The trivial ones that don't need any review can be lumped together and merged in a day.

For both, having a PR name and description that relates to them will provide a much better record and more visibility.

Also best not to merge all fixes at the exact same time for continuous integration purposes.

Sorry for requesting this (small) hassle. BTW I almost never submit HEAD to Github, always HEAD~n instead. I'm too lazy to have many branches and I "rotate" commits with a (good) git rebase client instead.

Dec 07 '22 17:12 marc-hb

Better than documentation, can we have a script that not just CI but anyone can run? I much prefer a quick and dirty but functional script than documentation as a starting point.

Yeah, me too. :) FWIW, there's no complexity here beyond building SOF as a Zephyr application for the "native_posix" board with clang, and specifying the needed kconfigs. I pasted it above too, but this is all I'm using locally:

export ZEPHYR_TOOLCHAIN_VARIANT=llvm
west build -b native_posix $SOF_DIR/modules/audio/sof/app -- \
    -DCONFIG_ARCH_POSIX_LIBFUZZER=y \
    -DCONFIG_ASAN=y
    -DCONFIG_ASSERT=y \
    -DCONFIG_SYS_HEAP_BIG_ONLY=y \
    -DCONFIG_ZEPHYR_NATIVE_DRIVERS=y \
    -DCONFIG_ARCH_POSIX_FUZZ_TICKS=100 \

mkdir -p ./fuzz_corpus
build/zephyr/zephyr.exe ./fuzz_corpus

Having typed that, I should probably just add it to the next PR. :)

Quick explanaions:

ARCH_POSIX_LIBFUZZER causes the resulting executable to be a fuzz test, it iteratively throws cases at the Zephyr app (they get received as a special "fuzz interrupt") which then turns them into IPC commands
ARCH_POSIX_FUZZ_TICKS is the interval between simulated fuzz input. Should be long enough that timers and DMA scheduled have a chance to fire, probably doesn't matter much as long as it's not too short. I haven't tried tuning it.
ASAN enables clang's AddressSanitizer. It's optional, but obviously very useful to have on to detect errors. (Ideally we'd try variant runs with MSAN too, which is similar but also has heap poisoning, but alas that only works with 64 bit builds. Zephyr supports native_posix_64, but SOF will need some surgery it seems.)
ASSERT is on for the same reason; to detect logic errors too

The others are just getting the device config to conform to I perceive the mainline SOF environment to be. The stubs in this PR implement a Zephyr DMA device/driver and not a SOF one, for example.

Is this still using oss-fuzz

Other way around: oss-fuzz is a service that will use this. This is based on "libfuzzer" and "asan", which are clang plugins supported by Zephyr.

Dec 07 '22 18:12 andyross

Having typed that, I should probably just add it to the next PR. :)

Thank you! I hope all this returns a "smart" and usable exit code (I'm still working through a huge disappointment with sparse 0b757a594f67ee)

Last but not least: for how long does (should?) this run? Depends on how fast the runner is?

Dec 07 '22 22:12 marc-hb

Oh sorry, I should back up. No, it runs forever until it finds a failure condition. :)

Basically the fuzz engine keeps feeding new semi-random input to the software under test, which was built with special instrumentation that can track coverage. When it finds a chunk of bytes that cause the program to do something different, it puts that into its notebook and starts trying variants looking for yet more new coverage. The idea is that this algorithm can explore all the available code states of a program in linear(ish) time, no matter how complicated or deeply nested.

So the idea is that you/we/someone (in practice "oss-fuzz") keeps running this regularly looking for cases as the software evolves.

But it's usable as a smoke test too, obviously. Just be prepared to kill it after 10 seconds or whatever.

Dec 07 '22 23:12 andyross

Just be prepared to kill it after 10 seconds or whatever.

20 min is considered a "sweet spot" for CI and that's about what our longest tests take. So we could let it run for that long. I just wondered how much coverage a standard Github runner can achieve in that duration. Much more than in 10 seconds I hope? :-)

Thanks to @aborisovich we now know how to schedule some daily tests, so we could let it run for much longer there: https://github.com/thesofproject/sof/actions/workflows/daily-tests.yml

In any case we need an exit code somehow so it can be converted to green or red. Interrupting a process tends not to return any useful exit code, does it? So, some grep-fu required again? (d70ac67032ffd1bc561454d) sigh

Dec 07 '22 23:12 marc-hb

For sure we'll need to experiment with this. Note that there are features still to arrive to push up coverage (e.g. I realized just this morning that IPC commands create components by specifying their UUID, and a 128 bit value is probably too much to search even for Magic Fuzzing, so I'll need to either provide some seed values or whitebox the command format).

And there's the question of "how long to run" too, which is sort of a mismatch with the fuzzing metaphor. Fuzzing more or less presumes that the same input will produce the same output, but a lot of firmware work involves state. So right now it just leaves the firmware running and accepts new IPC commands to the same OS instance. But THAT means that you tend to run out of heap memory quickly, so a lot of command handling ends up stubbed out in allocation error handlers and can't reach all the states. So... I guess I'm thinking I need to do a reboot after some amount of simulated runtime? But how to decide on how long? There's a bunch of arbitrary decisions to be made I suspect.

Dec 08 '22 00:12 andyross

But THAT means that you tend to run out of heap memory quickly,

@andyross sorry, why? Are there memory leaks?

Dec 08 '22 07:12 lyakh