core/clist_sort: rewrite
Contribution description
This rewrites clist_sort() with a different implementation of a merge sort. The goal is that the code is more readable and passes -fanalyzer without GCC barking, but it turns out to be a bit faster as well.
Testing procedure
This currently includes two other PRs for better testing, namely:
- [x] https://github.com/RIOT-OS/RIOT/pull/21403
- [ ] https://github.com/RIOT-OS/RIOT/pull/21402
(This PR needs to wait for those to be merged first and rebased then.)
Issues/PRs references
More complex but faster alternative to https://github.com/RIOT-OS/RIOT/pull/21398
Benchmarks (all SAME54-XPRO)
master
main(): This is RIOT! (Version: 2025.04-devel-528-g3e231-tests/bench/runtime_coreapis)
Runtime of Selected Core API functions
nop loop: 33335us --- 0.033us per call --- 29998500 calls per sec
mutex_init(): 2us --- 0.000us per call --- 1783793664 calls per sec
mutex lock/unlock: 541669us --- 0.541us per call --- 1846145 calls per sec
thread_flags_set(): 275002us --- 0.275us per call --- 3636337 calls per sec
thread_flags_clear(): 266669us --- 0.266us per call --- 3749967 calls per sec
thread flags set/wait any: 850002us --- 0.850us per call --- 1176467 calls per sec
thread flags set/wait all: 708335us --- 0.708us per call --- 1411761 calls per sec
thread flags set/wait one: 841669us --- 0.841us per call --- 1188115 calls per sec
msg_try_receive(): 416669us --- 0.416us per call --- 2399986 calls per sec
msg_avail(): 141669us --- 0.141us per call --- 7058707 calls per sec
clist_sort (#4, rev): 4494us --- 4.494us per call --- 222518 calls per sec
clist_sort (#4, prng): 4786us --- 4.786us per call --- 208942 calls per sec
clist_sort (#4, sort): 4485us --- 4.485us per call --- 222965 calls per sec
clist_sort (#4, ≈sort): 4752us --- 4.752us per call --- 210437 calls per sec
clist_sort (#8, rev): 10677us --- 10.677us per call --- 93659 calls per sec
clist_sort (#8, prng): 11777us --- 11.777us per call --- 84911 calls per sec
clist_sort (#8, sort): 10668us --- 10.668us per call --- 93738 calls per sec
clist_sort (#8, ≈sort): 11636us --- 11.636us per call --- 85940 calls per sec
clist_sort (#16, rev): 25294us --- 25.294us per call --- 39535 calls per sec
clist_sort (#16, prng): 27869us --- 27.869us per call --- 35882 calls per sec
clist_sort (#16, sort): 25286us --- 25.286us per call --- 39547 calls per sec
clist_sort (#16, ≈sort): 27836us --- 27.836us per call --- 35924 calls per sec
clist_sort (#32, rev): 59177us --- 59.177us per call --- 16898 calls per sec
clist_sort (#32, prng): 66877us --- 66.877us per call --- 14952 calls per sec
clist_sort (#32, sort): 59169us --- 59.169us per call --- 16900 calls per sec
clist_sort (#32, ≈sort): 65069us --- 65.069us per call --- 15368 calls per sec
clist_sort (#64, rev): 136394us --- 136.394us per call --- 7331 calls per sec
clist_sort (#64, prng): 157343us --- 157.343us per call --- 6355 calls per sec
clist_sort (#64, sort): 136386us --- 136.386us per call --- 7332 calls per sec
clist_sort (#64, ≈sort): 149169us --- 149.169us per call --- 6703 calls per sec
clist_sort (#128, rev): 309877us --- 309.877us per call --- 3227 calls per sec
clist_sort (#128, prng): 363552us --- 363.552us per call --- 2750 calls per sec
clist_sort (#128, sort): 309868us --- 309.868us per call --- 3227 calls per sec
clist_sort (#128, ≈sort): 336602us --- 336.602us per call --- 2970 calls per sec
clist_sort (#256, rev): 695093us --- 695.093us per call --- 1438 calls per sec
clist_sort (#256, prng): 821686us --- 821.686us per call --- 1217 calls per sec
clist_sort (#256, sort): 695086us --- 695.086us per call --- 1438 calls per sec
clist_sort (#256, ≈sort): 749902us --- 749.902us per call --- 1333 calls per sec
[SUCCESS]
Insertion Sort (https://github.com/RIOT-OS/RIOT/pull/21398)
main(): This is RIOT! (Version: 2025.04-devel-526-g66f4a-core/clist_sort_insertion_sort)
Runtime of Selected Core API functions
nop loop: 33335us --- 0.033us per call --- 29998500 calls per sec
mutex_init(): 2us --- 0.000us per call --- 1783793664 calls per sec
mutex lock/unlock: 541669us --- 0.541us per call --- 1846145 calls per sec
thread_flags_set(): 275002us --- 0.275us per call --- 3636337 calls per sec
thread_flags_clear(): 266669us --- 0.266us per call --- 3749967 calls per sec
thread flags set/wait any: 941669us --- 0.941us per call --- 1061944 calls per sec
thread flags set/wait all: 800002us --- 0.800us per call --- 1249996 calls per sec
thread flags set/wait one: 933335us --- 0.933us per call --- 1071426 calls per sec
msg_try_receive(): 416668us --- 0.416us per call --- 2399992 calls per sec
msg_avail(): 133336us --- 0.133us per call --- 7499850 calls per sec
clist_sort (#4, rev): 2411us --- 2.411us per call --- 414765 calls per sec
clist_sort (#4, prng): 2685us --- 2.685us per call --- 372439 calls per sec
clist_sort (#4, sort): 1636us --- 1.636us per call --- 611246 calls per sec
clist_sort (#4, ≈sort): 1710us --- 1.710us per call --- 584795 calls per sec
clist_sort (#8, rev): 5510us --- 5.510us per call --- 181488 calls per sec
clist_sort (#8, prng): 6961us --- 6.961us per call --- 143657 calls per sec
clist_sort (#8, sort): 2869us --- 2.869us per call --- 348553 calls per sec
clist_sort (#8, ≈sort): 4768us --- 4.768us per call --- 209731 calls per sec
clist_sort (#16, rev): 11710us --- 11.710us per call --- 85397 calls per sec
clist_sort (#16, prng): 17485us --- 17.485us per call --- 57191 calls per sec
clist_sort (#16, sort): 5336us --- 5.336us per call --- 187406 calls per sec
clist_sort (#16, ≈sort): 8302us --- 8.302us per call --- 120452 calls per sec
clist_sort (#32, rev): 24111us --- 24.111us per call --- 41474 calls per sec
clist_sort (#32, prng): 58852us --- 58.852us per call --- 16991 calls per sec
clist_sort (#32, sort): 10269us --- 10.269us per call --- 97380 calls per sec
clist_sort (#32, ≈sort): 15369us --- 15.369us per call --- 65066 calls per sec
clist_sort (#64, rev): 48910us --- 48.910us per call --- 20445 calls per sec
clist_sort (#64, prng): 200985us --- 200.985us per call --- 4975 calls per sec
clist_sort (#64, sort): 20135us --- 20.135us per call --- 49664 calls per sec
clist_sort (#64, ≈sort): 29502us --- 29.502us per call --- 33896 calls per sec
clist_sort (#128, rev): 98510us --- 98.510us per call --- 10151 calls per sec
clist_sort (#128, prng): 804885us --- 804.885us per call --- 1242 calls per sec
clist_sort (#128, sort): 39869us --- 39.869us per call --- 25082 calls per sec
clist_sort (#128, ≈sort): 57768us --- 57.768us per call --- 17310 calls per sec
clist_sort (#256, rev): 197710us --- 197.710us per call --- 5057 calls per sec
clist_sort (#256, prng): 3033951us --- 3033.951us per call --- 329 calls per sec
clist_sort (#256, sort): 79335us --- 79.335us per call --- 12604 calls per sec
clist_sort (#256, ≈sort): 114302us --- 114.302us per call --- 8748 calls per sec
[SUCCESS]
This PR
main(): This is RIOT! (Version: 2025.04-devel-533-g6c31a8-core/clist_sort_merge_sort_rewrite)
Runtime of Selected Core API functions
nop loop: 33335us --- 0.033us per call --- 29998500 calls per sec
mutex_init(): 2us --- 0.000us per call --- 1783793664 calls per sec
mutex lock/unlock: 541669us --- 0.541us per call --- 1846145 calls per sec
thread_flags_set(): 275002us --- 0.275us per call --- 3636337 calls per sec
thread_flags_clear(): 266669us --- 0.266us per call --- 3749967 calls per sec
thread flags set/wait any: 850002us --- 0.850us per call --- 1176467 calls per sec
thread flags set/wait all: 708335us --- 0.708us per call --- 1411761 calls per sec
thread flags set/wait one: 841668us --- 0.841us per call --- 1188116 calls per sec
msg_try_receive(): 416669us --- 0.416us per call --- 2399986 calls per sec
msg_avail(): 141668us --- 0.141us per call --- 7058757 calls per sec
clist_sort (#4, rev): 5128us --- 5.128us per call --- 195007 calls per sec
clist_sort (#4, prng): 5427us --- 5.427us per call --- 184263 calls per sec
clist_sort (#4, sort): 5110us --- 5.110us per call --- 195694 calls per sec
clist_sort (#4, ≈sort): 5361us --- 5.361us per call --- 186532 calls per sec
clist_sort (#8, rev): 8936us --- 8.936us per call --- 111906 calls per sec
clist_sort (#8, prng): 10185us --- 10.185us per call --- 98183 calls per sec
clist_sort (#8, sort): 8919us --- 8.919us per call --- 112120 calls per sec
clist_sort (#8, ≈sort): 10044us --- 10.044us per call --- 99561 calls per sec
clist_sort (#16, rev): 17144us --- 17.144us per call --- 58329 calls per sec
clist_sort (#16, prng): 20460us --- 20.460us per call --- 48875 calls per sec
clist_sort (#16, sort): 17128us --- 17.128us per call --- 58383 calls per sec
clist_sort (#16, ≈sort): 20210us --- 20.210us per call --- 49480 calls per sec
clist_sort (#32, rev): 36577us --- 36.577us per call --- 27339 calls per sec
clist_sort (#32, prng): 46569us --- 46.569us per call --- 21473 calls per sec
clist_sort (#32, sort): 36560us --- 36.560us per call --- 27352 calls per sec
clist_sort (#32, ≈sort): 43769us --- 43.769us per call --- 22847 calls per sec
clist_sort (#64, rev): 78635us --- 78.635us per call --- 12716 calls per sec
clist_sort (#64, prng): 105711us --- 105.711us per call --- 9459 calls per sec
clist_sort (#64, sort): 78619us --- 78.619us per call --- 12719 calls per sec
clist_sort (#64, ≈sort): 94285us --- 94.285us per call --- 10606 calls per sec
clist_sort (#128, rev): 169203us --- 169.203us per call --- 5910 calls per sec
clist_sort (#128, prng): 238669us --- 238.669us per call --- 4189 calls per sec
clist_sort (#128, sort): 169186us --- 169.186us per call --- 5910 calls per sec
clist_sort (#128, ≈sort): 201977us --- 201.977us per call --- 4951 calls per sec
clist_sort (#256, rev): 362402us --- 362.402us per call --- 2759 calls per sec
clist_sort (#256, prng): 527886us --- 527.886us per call --- 1894 calls per sec
clist_sort (#256, sort): 362386us --- 362.386us per call --- 2759 calls per sec
clist_sort (#256, ≈sort): 446302us --- 446.302us per call --- 2240 calls per sec
[SUCCESS]
Nice!
Somehow on the nrf52840dk, it seems this implementation gets slower a bit quicker with larger inputs. like, master sorts 16k nodes 150000us whereas this pr needs 500000us per call ("almost sorted"). Any idea?
That is expected. The merge sort is best tuned, when merging only equally sized lists. With lists this either requires a lot of traversal or some caching of where the lists heads are. This PR uses the caching, which results in better speed when the size of the fixed cache is good, but worse speed, when it is too small.
Tuning CONFIG_CLIST_SORT_BUF_SIZE would help. I guess when spending 16K * sizeof(clist_node_t) on RAM just for list organization, spending a bit more stack size by bumping CONFIG_CLIST_SORT_BUF_SIZE would not be out of the question.