RIOT
RIOT copied to clipboard
[RFC] pkg/cmsis: use unique package for CMSIS headers, DSP and NN modules
Contribution description
The cpu/cortexm_common
directory contains the Core CMSIS headers in a vendor subdirectory. The CMSIS-DSP and CMSIS-NN packages are both fetching code from the same ARM-Software/CMSIS_5
repository.
This situation leads to:
- not in sync code: I don't know what version of the Core headers is used in cortexm_common, in master CMSIS-DSP and CMSIS-NN don't use the same version of the code (5.4 for the former, 5.6 for the latter)
- hard to maintain: a code update on one side (CMSIS-NN updated to 5.9.0 for example), might result in a failed build unless the vendor code is also updated.
This PR proposes to move all that in a single package and make the cortexm_common module to depend on it. cmsis-dsp and cmsis-nn becomes 2 extra modules of the cmsis package.
The last version, 5.9.0, is used by the cmsis package. Some adaptation were required in some i2c implementations that are defining a _start
function. The CMSIS headers also contains a similar definition. Some changes were also required in the DSP code (mostly build system changes).
Also note that the cmsis package is only downloaded once (the very first time an application is built for a cortexm target) in build/pkg/cmsis, after that the same code is reused so there won't be any build overhead in the long run for a normal user.
Using a package for a such central component that is touching a very large part of the hardware supported by RIOT might have some impact on CI performance. Even if I find this problem secondary, it must be taken into account and that's why I marked this PR as RFC.
Testing procedure
- Green Murdock
- Test on various ARM boards should be performed
Issues/PRs references
None
Hi! What kind of test or tests you think about? I have just clone your PR, compile blinky program and flash it to l031k6 (M0, nucleo-32), l073rz (M0, nucleo-64) and f428zi (M4, nucleo-144). Everything works perfect. I could tomorrow test it in such manner on l552ze-q, f334r8 and discovery stm32f469i-disco.
What kind of test or tests you think about?
I think that running ./dist/tools/compile_and_test_for_board/compile_and_test_for_board.py . <board> --with-test-only --jobs=4
would be useful but will take some time (a couple of hours per board).
Thanks for clarification. I try run mentioned tests tomorrow.
This seems to almost pass the CI now but I don't understand the Rust failure: riot_wrappers.0bcd1c26-cgu.11:(.text.rust_begin_unwind+0xc): undefined reference to
llvm_asm_is_not_supported_any_more'`. Any idea @chrysn ?
Hi, during the weekend I run test on nucleo-l552ze-q. 218 test ended with success, 20 with failure - but in most cases due to lack of some python modules or additional programs. However, there are some strange timeouts. Below output from script summary - with my comments concerning failure reason.
ERROR:nucleo-l552ze-q:Tests failed: 20
Failures during compilation:
- [tests/pkg_emlearn](tests/pkg_emlearn/compilation.failed)
"No module named 'emlearn'
- [tests/pkg_nanopb](tests/pkg_nanopb/compilation.failed)
"protoc: Command not found"
- [tests/suit_manifest](tests/suit_manifest/compilation.failed)
"No module named 'cbor2'
Failures during test:
- [examples/micropython](examples/micropython/test.failed)
ests/congure_reno](tests/congure_reno/test.failed)
"No module named 'riotctrl'
- [tests/congure_test](tests/congure_test/test.failed)
"No module named 'riotctrl'
- [tests/cpp11_condition_variable](tests/cpp11_condition_variable/test.failed)
pexpect.exceptions.TIMEOUT: Timeout exceeded
- [tests/gnrc_rpl](tests/gnrc_rpl/test.failed)
"No module named 'riotctrl'
- [tests/malloc_thread_safety](tests/malloc_thread_safety/test.failed)
"Timeout in expect script at "child.expect("TEST PASSED")"
- [tests/pbkdf2](tests/pbkdf2/test.failed)
lack of socat program
- [tests/periph_timer_short_relative_set](tests/periph_timer_short_relative_set/test.failed)
"too long delay, aborted after 1001 ... TEST FAILED"
- [tests/periph_wdt](tests/periph_wdt/test.failed)
"Timeout in expect script" -> line 54
- [tests/pthread_rwlock](tests/pthread_rwlock/test.failed)
"Timeout in expect script" -> line 13 + stack pointer corrupted
- [tests/riotboot](tests/riotboot/test.failed)
"pexcpect.exceptions.TIMEOUT: Timeout exceeed."
- [tests/riotboot_hdr](tests/riotboot_hdr/test.failed)
"pexcpect.exceptions.TIMEOUT: Timeout exceeed."
- [tests/sched_change_priority](tests/sched_change_priority/test.failed)
lack of socat program
- [tests/shell](tests/shell/test.failed)
lack of socat program
- [tests/sys_sema_inv](tests/sys_sema_inv/test.failed)
"Timeout in expect script" -> line 12 + stack pointer corrupted
- [tests/turo](tests/turo/test.failed)
"No module named 'riotctrl'
- [tests/unittests](tests/unittests/test.failed)
"pexcpect.exceptions.TIMEOUT: Timeout exceeed."
If any output files are needed I could place them in forum.
I could run these tests at other boards - are you interested? Should I first install all remaining software?
I could run these tests at other boards - are you interested?
That would be nice
Should I first install all remaining software?
You can use BUILD_IN_DOCKER=1
and it will fix all build failures since the Docker image provides all tools. For running the test, you have to install socat
(apt-get) and riotctrl (pip install) locally and they should workl.
@chrysn, any idea about this Rust related failure reported by Murdock ?
riot_wrappers.0bcd1c26-cgu.11:(.text.rust_begin_unwind+0xc): undefined reference to llvm_asm_is_not_supported_any_more'
Yes: due to mismatched versions (expecting a permament fix within this week), assembly called from Rust currently needs manual translation in an explicit table that accounts for syntax differences.
The new CMSIS probably changed something about how a particular assembly instruction ("interrupts on", "interrupts off" and "check interrupt status" are the ones I know from there off my head) are expressed, and they need a new translation. If this is urgent, I can update the translations (or point you to where to look to save you the time if you want to do it yourself); otherwise, this will resolve itself with the CI update that introduces the newer C2Rust.
(As for the definition of 'urgent' ... their tracking issue https://github.com/immunant/c2rust/issues/388 for the release went from "4 things we want to get done before we push a release" to "all 4 checked" in one week, has been at "we just need to do the release" for less than a day now, and I expect they just reap some cheap merges before pushing it; then it should be a matter of a rather straightforward CI update).
Thanks for the answer @chrysn. This PR is still RFC so not really urgent for now on the actual RIOT code base are quite minimal (mostly code removal and minimal updates in tests) so should be quite easy to rebase from time to time. The main question is the CI build time that seems to increase.
Test results for nucleo-l073rz. 234 tests with success 7 tests with failure - more details below (failuressummary.md with my comments)
Failures during compilation:
- [tests/pkg_emlearn](tests/pkg_emlearn/compilation.failed)
- [tests/rust_minimal](tests/rust_minimal/compilation.failed)
Failures during test:
- [examples/micropython](examples/micropython/test.failed)
Unexpected end of file in expect script -> 01-run.py line 15
- [tests/gnrc_rpl](tests/gnrc_rpl/test.failed)
pexpect.exceptions.EOF -> 01-run.py line 175, 168, 93
- [tests/pbkdf2](tests/pbkdf2/test.failed)
pexpect.exceptions.TIMEOUT -> 02-random.py line 80
- [tests/periph_timer_short_relative_set](tests/periph_timer_short_relative_set/test.failed)
Timeout in expect script -> 01-run.py line 14
- [tests/riotboot](tests/riotboot/test.failed)
Timeout in expect script -> 01-run.py line 34
Test results for nucleo-l432kc. 233 tests with success 10 tests with failure - more details below (failuressummary.md with my comments)
Failures during compilation:
- [tests/pkg_emlearn](tests/pkg_emlearn/compilation.failed)
- [tests/pkg_tflite-micro](tests/pkg_tflite-micro/compilation.failed)
This test is 'too big' for this board - I add PR 18101
- [tests/rust_minimal](tests/rust_minimal/compilation.failed)
Failures during test:
- [examples/micropython](examples/micropython/test.failed)
Unexpected end of file in expect script -> 01-run.py line 15
- [tests/gnrc_rpl](tests/gnrc_rpl/test.failed)
pexepect.exceptions.EOF -> 01-run.py line 175, 168, 93
- [tests/pbkdf2](tests/pbkdf2/test.failed)
pexepect.exceptios.TIMEOUT -> 02-random.py line 80
- [tests/periph_timer_short_relative_set](tests/periph_timer_short_relative_set/test.failed)
Timeout in expect script -> 01-run.py line 14
- [tests/pkg_utensor](tests/pkg_utensor/test.failed)
Timeout in expect script -> 01-run.py line 8
- [tests/pkg_wolfssl](tests/pkg_wolfssl/test.failed)
pexpect.exceptions.TIMEOUT -> 01-run.py line 41
- [tests/riotboot](tests/riotboot/test.failed)
Timeout in expect script -> 01-run.py line 34
Funny how just as this started still showing assembler errors with the latest CI update, I got a comment from the C2Rust authors on a fix I did for RISCV that it needs to be generic -- then it'd have caught this already. Spooling this for another round of CI updates, sorry for the delay.
@chrysn what is the situation on the Rust side regarding this PR ?
It just needs https://github.com/RIOT-OS/riotdocker/pull/196 (or the update to the C2Rust version without the build system changes at https://github.com/RIOT-OS/riotdocker/pull/199) to be ACKed. If you prefer to review just the bump in version, I can also do this on the older build infrastructure (the one I'm largely removing in 196/199), but then it's an interleaved-step thing with docker uploads, whereas the new version is just let bors merge it, github actions push it, and restart the worker.
Awesome, it's all green! ... but took 3h23m...
Awesome, it's all green! ... but took 3h23m...
It's not the PR, it's the CI. Even #18208 took 3 hours and that was a docs-only change (which just due to some bug in detecting those was fully built).
Even #18208 took 3 hours
Well it took exactly 2h47m, it's still 36 minutes less. So this PR adds ~20% total CPU time, it's a lot TBH.
Well it took exactly 2h47m, it's still 36 minutes less. So this PR adds ~20% total CPU time, it's a lot TBH.
Times vary too vastly to tell. #17680 is a local change (should not affect most of the builds) and just took 5 hours.
If there is more reason to worry than the CI times, you could run a local test between the branches on any representative board; 1 minute of build time locally probably tells more than several hours on CI.
Murdock results
:heavy_check_mark: PASSED
eac9cdfd4a7f44ac3a3b45e047c5ae28cc37b6b6 pkg/gecko_sdk: disable cast-align globally
Success | Failures | Total | Runtime |
---|---|---|---|
6931 | 0 | 6931 | 09m:49s |
Artifacts
Before merging this I'd like to agree on the strategy and check the impact on build times.
bors merge-
Canceled.
Here are some Murdock stats:
- this PR job: 2005 builds took 3h 56m 15s in total CPU time
- this job: 2007 builds took 2h 51m 53s
- this job: 2007 builds took 2h 57m 49s
- this job: 2005 builds took 2h 55m 47s
So it seems this PR adds 1h in total, which is 30% increase on a subset of boards (2 ARM based out of 4).
Urg that's bad. Is that not just an effect of the cache being cold on this one? I would have expected the package to be only checked out once per worker and from the on the build times should be the same.
I would have expected the package to be only checked out once per worker and from the on the build times should be the same.
It's supposed to be fetched once by gitcache but it seems the copy from one local location to another (in RAM) adds a very little time for each build which in total becomes significant. But I'm not sure of this analysis. Maybe @kaspar030 can confirm ?
To rule out caching effects we could just merge this and if build times don’t recover after a few runs, revert this again.
What do you think?
bors try
I rebased this PR and fixed Murdock but I still don't really know the reason of the ~30% build time increase and if that's acceptable. I do believe that this PR is very useful (less vendor code in the code base, consistent CMSIS versions used, etc) but it might have a significant CI build cost.
Hm, so now every Cortex-M build depends on that package instead of the in-tree version. That at least trashes all of ccache. Let's re-build this a couple of times.
so now every Cortex-M build depends on that package instead of the in-tree version. That at least trashes all of ccache. Let's re-build this a couple of times.
After a few runs the CI build times is reduced, so cache warmup indeed helps.