RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

[RFC] pkg/cmsis: use unique package for CMSIS headers, DSP and NN modules

Open aabadie opened this issue 2 years ago • 18 comments

Contribution description

The cpu/cortexm_common directory contains the Core CMSIS headers in a vendor subdirectory. The CMSIS-DSP and CMSIS-NN packages are both fetching code from the same ARM-Software/CMSIS_5 repository. This situation leads to:

  • not in sync code: I don't know what version of the Core headers is used in cortexm_common, in master CMSIS-DSP and CMSIS-NN don't use the same version of the code (5.4 for the former, 5.6 for the latter)
  • hard to maintain: a code update on one side (CMSIS-NN updated to 5.9.0 for example), might result in a failed build unless the vendor code is also updated.

This PR proposes to move all that in a single package and make the cortexm_common module to depend on it. cmsis-dsp and cmsis-nn becomes 2 extra modules of the cmsis package.

The last version, 5.9.0, is used by the cmsis package. Some adaptation were required in some i2c implementations that are defining a _start function. The CMSIS headers also contains a similar definition. Some changes were also required in the DSP code (mostly build system changes).

Also note that the cmsis package is only downloaded once (the very first time an application is built for a cortexm target) in build/pkg/cmsis, after that the same code is reused so there won't be any build overhead in the long run for a normal user.

Using a package for a such central component that is touching a very large part of the hardware supported by RIOT might have some impact on CI performance. Even if I find this problem secondary, it must be taken into account and that's why I marked this PR as RFC.

Testing procedure

  • Green Murdock
  • Test on various ARM boards should be performed

Issues/PRs references

None

aabadie avatar May 04 '22 19:05 aabadie

Hi! What kind of test or tests you think about? I have just clone your PR, compile blinky program and flash it to l031k6 (M0, nucleo-32), l073rz (M0, nucleo-64) and f428zi (M4, nucleo-144). Everything works perfect. I could tomorrow test it in such manner on l552ze-q, f334r8 and discovery stm32f469i-disco.

krzysztof-cabaj avatar May 05 '22 07:05 krzysztof-cabaj

What kind of test or tests you think about?

I think that running ./dist/tools/compile_and_test_for_board/compile_and_test_for_board.py . <board> --with-test-only --jobs=4 would be useful but will take some time (a couple of hours per board).

aabadie avatar May 05 '22 07:05 aabadie

Thanks for clarification. I try run mentioned tests tomorrow.

krzysztof-cabaj avatar May 05 '22 08:05 krzysztof-cabaj

This seems to almost pass the CI now but I don't understand the Rust failure: riot_wrappers.0bcd1c26-cgu.11:(.text.rust_begin_unwind+0xc): undefined reference to llvm_asm_is_not_supported_any_more'`. Any idea @chrysn ?

aabadie avatar May 06 '22 19:05 aabadie

Hi, during the weekend I run test on nucleo-l552ze-q. 218 test ended with success, 20 with failure - but in most cases due to lack of some python modules or additional programs. However, there are some strange timeouts. Below output from script summary - with my comments concerning failure reason.

ERROR:nucleo-l552ze-q:Tests failed: 20
Failures during compilation:
- [tests/pkg_emlearn](tests/pkg_emlearn/compilation.failed)
     "No module named 'emlearn'
- [tests/pkg_nanopb](tests/pkg_nanopb/compilation.failed)
    "protoc: Command not found"
- [tests/suit_manifest](tests/suit_manifest/compilation.failed)
    "No module named 'cbor2'

Failures during test:
- [examples/micropython](examples/micropython/test.failed)
    ests/congure_reno](tests/congure_reno/test.failed)
    "No module named 'riotctrl'
- [tests/congure_test](tests/congure_test/test.failed)
    "No module named 'riotctrl'
- [tests/cpp11_condition_variable](tests/cpp11_condition_variable/test.failed)
     pexpect.exceptions.TIMEOUT: Timeout exceeded
- [tests/gnrc_rpl](tests/gnrc_rpl/test.failed)
    "No module named 'riotctrl'
- [tests/malloc_thread_safety](tests/malloc_thread_safety/test.failed)
  "Timeout in expect script at "child.expect("TEST PASSED")"
- [tests/pbkdf2](tests/pbkdf2/test.failed)
   lack of socat program
- [tests/periph_timer_short_relative_set](tests/periph_timer_short_relative_set/test.failed)
    "too long delay, aborted after 1001 ... TEST FAILED"
- [tests/periph_wdt](tests/periph_wdt/test.failed)
    "Timeout in expect script" -> line 54
- [tests/pthread_rwlock](tests/pthread_rwlock/test.failed)
    "Timeout in expect script" -> line 13  + stack pointer corrupted
- [tests/riotboot](tests/riotboot/test.failed)
    "pexcpect.exceptions.TIMEOUT: Timeout exceeed."
- [tests/riotboot_hdr](tests/riotboot_hdr/test.failed)
    "pexcpect.exceptions.TIMEOUT: Timeout exceeed."
- [tests/sched_change_priority](tests/sched_change_priority/test.failed)
    lack of socat program
- [tests/shell](tests/shell/test.failed)
    lack of socat program
- [tests/sys_sema_inv](tests/sys_sema_inv/test.failed)
    "Timeout in expect script" -> line 12  + stack pointer corrupted
- [tests/turo](tests/turo/test.failed)
    "No module named 'riotctrl'
- [tests/unittests](tests/unittests/test.failed)
    "pexcpect.exceptions.TIMEOUT: Timeout exceeed."

If any output files are needed I could place them in forum.

I could run these tests at other boards - are you interested? Should I first install all remaining software?

krzysztof-cabaj avatar May 09 '22 12:05 krzysztof-cabaj

I could run these tests at other boards - are you interested?

That would be nice

Should I first install all remaining software?

You can use BUILD_IN_DOCKER=1 and it will fix all build failures since the Docker image provides all tools. For running the test, you have to install socat (apt-get) and riotctrl (pip install) locally and they should workl.

aabadie avatar May 10 '22 07:05 aabadie

@chrysn, any idea about this Rust related failure reported by Murdock ?

riot_wrappers.0bcd1c26-cgu.11:(.text.rust_begin_unwind+0xc): undefined reference to llvm_asm_is_not_supported_any_more'

aabadie avatar May 10 '22 13:05 aabadie

Yes: due to mismatched versions (expecting a permament fix within this week), assembly called from Rust currently needs manual translation in an explicit table that accounts for syntax differences.

The new CMSIS probably changed something about how a particular assembly instruction ("interrupts on", "interrupts off" and "check interrupt status" are the ones I know from there off my head) are expressed, and they need a new translation. If this is urgent, I can update the translations (or point you to where to look to save you the time if you want to do it yourself); otherwise, this will resolve itself with the CI update that introduces the newer C2Rust.

(As for the definition of 'urgent' ... their tracking issue https://github.com/immunant/c2rust/issues/388 for the release went from "4 things we want to get done before we push a release" to "all 4 checked" in one week, has been at "we just need to do the release" for less than a day now, and I expect they just reap some cheap merges before pushing it; then it should be a matter of a rather straightforward CI update).

chrysn avatar May 10 '22 14:05 chrysn

Thanks for the answer @chrysn. This PR is still RFC so not really urgent for now on the actual RIOT code base are quite minimal (mostly code removal and minimal updates in tests) so should be quite easy to rebase from time to time. The main question is the CI build time that seems to increase.

aabadie avatar May 10 '22 15:05 aabadie

Test results for nucleo-l073rz. 234 tests with success 7 tests with failure - more details below (failuressummary.md with my comments)

Failures during compilation:
- [tests/pkg_emlearn](tests/pkg_emlearn/compilation.failed)
- [tests/rust_minimal](tests/rust_minimal/compilation.failed)

Failures during test:
- [examples/micropython](examples/micropython/test.failed)
    Unexpected end of file in expect script -> 01-run.py line 15
- [tests/gnrc_rpl](tests/gnrc_rpl/test.failed)
    pexpect.exceptions.EOF -> 01-run.py line 175, 168, 93
- [tests/pbkdf2](tests/pbkdf2/test.failed)
    pexpect.exceptions.TIMEOUT -> 02-random.py line 80
- [tests/periph_timer_short_relative_set](tests/periph_timer_short_relative_set/test.failed)
    Timeout in expect script -> 01-run.py line 14
- [tests/riotboot](tests/riotboot/test.failed)
   Timeout in expect script -> 01-run.py line 34

krzysztof-cabaj avatar May 12 '22 07:05 krzysztof-cabaj

Test results for nucleo-l432kc. 233 tests with success 10 tests with failure - more details below (failuressummary.md with my comments)

Failures during compilation:
- [tests/pkg_emlearn](tests/pkg_emlearn/compilation.failed)
- [tests/pkg_tflite-micro](tests/pkg_tflite-micro/compilation.failed)
   This test is 'too big' for this board - I add PR 18101
- [tests/rust_minimal](tests/rust_minimal/compilation.failed)

Failures during test:
- [examples/micropython](examples/micropython/test.failed)
   Unexpected end of file in expect script -> 01-run.py line 15
- [tests/gnrc_rpl](tests/gnrc_rpl/test.failed)
   pexepect.exceptions.EOF -> 01-run.py line 175, 168, 93
- [tests/pbkdf2](tests/pbkdf2/test.failed)
   pexepect.exceptios.TIMEOUT -> 02-random.py line 80
- [tests/periph_timer_short_relative_set](tests/periph_timer_short_relative_set/test.failed)    
   Timeout in expect script -> 01-run.py line 14
- [tests/pkg_utensor](tests/pkg_utensor/test.failed)
   Timeout in expect script -> 01-run.py line 8
- [tests/pkg_wolfssl](tests/pkg_wolfssl/test.failed)
   pexpect.exceptions.TIMEOUT -> 01-run.py line 41
- [tests/riotboot](tests/riotboot/test.failed)
   Timeout in expect script -> 01-run.py line 34

krzysztof-cabaj avatar May 13 '22 11:05 krzysztof-cabaj

Funny how just as this started still showing assembler errors with the latest CI update, I got a comment from the C2Rust authors on a fix I did for RISCV that it needs to be generic -- then it'd have caught this already. Spooling this for another round of CI updates, sorry for the delay.

chrysn avatar May 13 '22 17:05 chrysn

@chrysn what is the situation on the Rust side regarding this PR ?

aabadie avatar Jun 14 '22 14:06 aabadie

It just needs https://github.com/RIOT-OS/riotdocker/pull/196 (or the update to the C2Rust version without the build system changes at https://github.com/RIOT-OS/riotdocker/pull/199) to be ACKed. If you prefer to review just the bump in version, I can also do this on the older build infrastructure (the one I'm largely removing in 196/199), but then it's an interleaved-step thing with docker uploads, whereas the new version is just let bors merge it, github actions push it, and restart the worker.

chrysn avatar Jun 14 '22 14:06 chrysn

Awesome, it's all green! ... but took 3h23m...

aabadie avatar Jun 16 '22 06:06 aabadie

Awesome, it's all green! ... but took 3h23m...

It's not the PR, it's the CI. Even #18208 took 3 hours and that was a docs-only change (which just due to some bug in detecting those was fully built).

chrysn avatar Jun 16 '22 07:06 chrysn

Even #18208 took 3 hours

Well it took exactly 2h47m, it's still 36 minutes less. So this PR adds ~20% total CPU time, it's a lot TBH.

aabadie avatar Jun 16 '22 07:06 aabadie

Well it took exactly 2h47m, it's still 36 minutes less. So this PR adds ~20% total CPU time, it's a lot TBH.

Times vary too vastly to tell. #17680 is a local change (should not affect most of the builds) and just took 5 hours.

If there is more reason to worry than the CI times, you could run a local test between the branches on any representative board; 1 minute of build time locally probably tells more than several hours on CI.

chrysn avatar Jun 16 '22 07:06 chrysn

Murdock results

:heavy_check_mark: PASSED

eac9cdfd4a7f44ac3a3b45e047c5ae28cc37b6b6 pkg/gecko_sdk: disable cast-align globally

Success Failures Total Runtime
6931 0 6931 09m:49s

Artifacts

riot-ci avatar Sep 29 '22 07:09 riot-ci

Before merging this I'd like to agree on the strategy and check the impact on build times.

bors merge-

aabadie avatar Dec 19 '22 13:12 aabadie

Canceled.

bors[bot] avatar Dec 19 '22 13:12 bors[bot]

Here are some Murdock stats:

So it seems this PR adds 1h in total, which is 30% increase on a subset of boards (2 ARM based out of 4).

aabadie avatar Dec 19 '22 13:12 aabadie

Urg that's bad. Is that not just an effect of the cache being cold on this one? I would have expected the package to be only checked out once per worker and from the on the build times should be the same.

benpicco avatar Dec 19 '22 13:12 benpicco

I would have expected the package to be only checked out once per worker and from the on the build times should be the same.

It's supposed to be fetched once by gitcache but it seems the copy from one local location to another (in RAM) adds a very little time for each build which in total becomes significant. But I'm not sure of this analysis. Maybe @kaspar030 can confirm ?

aabadie avatar Dec 19 '22 14:12 aabadie

To rule out caching effects we could just merge this and if build times don’t recover after a few runs, revert this again.

What do you think?

benpicco avatar Dec 29 '22 01:12 benpicco

bors try

benpicco avatar Dec 29 '22 01:12 benpicco

try

Build failed:

bors[bot] avatar Dec 29 '22 03:12 bors[bot]

I rebased this PR and fixed Murdock but I still don't really know the reason of the ~30% build time increase and if that's acceptable. I do believe that this PR is very useful (less vendor code in the code base, consistent CMSIS versions used, etc) but it might have a significant CI build cost.

aabadie avatar May 14 '23 19:05 aabadie

Hm, so now every Cortex-M build depends on that package instead of the in-tree version. That at least trashes all of ccache. Let's re-build this a couple of times.

kaspar030 avatar May 15 '23 06:05 kaspar030

so now every Cortex-M build depends on that package instead of the in-tree version. That at least trashes all of ccache. Let's re-build this a couple of times.

After a few runs the CI build times is reduced, so cache warmup indeed helps.

aabadie avatar May 15 '23 08:05 aabadie