use SIMD.jl for x86 and `naive_findmin` for `:aarch64`
Just a test if I understand how to replace fast_findmin implementation as in #83
after fixing that, on Apple M4, I get:
julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
# main
Processed 100 events 32 times
Average time per event 135.5111340625 ± 2.9835167009602035 μs
Lowest time per event 131.43084000000002 μs
# this PR
Processed 100 events 32 times
Average time per event 138.62045531249998 ± 3.5853189437051123 μs
Lowest time per event 134.35666 μs
# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 147.96477937499995 ± 4.669684823430356 μs
Lowest time per event 139.74125 μs
# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 152.88778656250003 ± 5.847938860407047 μs
Lowest time per event 142.36916000000002 μs
this looks not bad??!? @graeme-a-stewart Although, there's an unfortuante degredation on 1.12
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 80.46%. Comparing base (d3c97f3) to head (bfbae69).
:warning: Report is 3 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #151 +/- ##
==========================================
+ Coverage 80.00% 80.46% +0.46%
==========================================
Files 21 21
Lines 1315 1341 +26
==========================================
+ Hits 1052 1079 +27
+ Misses 263 262 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Nvidia Grace
julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
# main
Processed 100 events 32 times
Average time per event 299.6035559375 ± 7.914782725959611 μs
Lowest time per event 280.66586 μs
# this PR
Processed 100 events 32 times
Average time per event 303.1709484375001 ± 8.819781288405308 μs
Lowest time per event 286.57968 μs
# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 304.46982093750006 ± 13.922266876931305 μs
Lowest time per event 281.39805 μs
# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 308.6548684375 ± 11.580493168517764 μs
Lowest time per event 288.65558 μs
AMD Ryzen 7 5700G
# main
Processed 100 events 32 times
Average time per event 188.28086562500002 ± 5.529621156269813 μs
Lowest time per event 181.3509 μs
# this PR
Processed 100 events 32 times
Average time per event 197.2376540625 ± 6.685658460456483 μs
Lowest time per event 186.34546999999998 μs
# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 193.79122999999998 ± 11.42792473599183 μs
Lowest time per event 176.08296 μs
# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 205.20970093750003 ± 9.736581454767695 μs
Lowest time per event 187.33396000000002 μs
AMD Ryzen 7 5700G
yeah ok so we use the SIMD.jl for x86 and this PR for ARM?
what happened to CI lol...
For completeness, with SIMD for x86_64
AMD Ryzen 7 5700G
# main
Processed 100 events 32 times
Average time per event 187.38033500000006 ± 5.74337091484025 μs
Lowest time per event 177.03909 μs
# this PR
Processed 100 events 32 times
Average time per event 191.52783343750005 ± 6.355636113548961 μs
Lowest time per event 179.90162 μs
# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 195.5927890625 ± 12.11654841840912 μs
Lowest time per event 178.41858 μs
# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 194.36534874999998 ± 10.480610090304218 μs
Lowest time per event 176.65239000000003 μs
https://github.com/JuliaLang/julia/pull/58418
I will test against this PR later
Would be good to run CI on aarch64 (Linux and/or macOS), no?
julia #58418 on M4
Processed 100 events 32 times
Average time per event 151.89809874999995 ± 5.095528318872066 μs
Lowest time per event 143.37917 μs
so this is same as nightly, and slower than Julia 1.11
I took a look at performance on my M2 Pro (where the original patch in #83 was sucky). For 1.11.4 I got:
~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 171.11460874999997 ± 4.616430628080113 μs
Lowest time per event 165.90792000000002 μs
Processed 100 events 16 times
Average time per event 674.727474375 ± 18.35030922586858 μs
Lowest time per event 656.97167 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.442970625 ± 0.30894069468529917 μs
Lowest time per event 27.61167 μs
and with this patch:
~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh 256
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 175.676691875 ± 12.098880482560652 μs
Lowest time per event 167.775 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 671.201953125 ± 5.289937381395405 μs
Lowest time per event 662.80833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.853904375 ± 0.5296152717583557 μs
Lowest time per event 27.69208 μs
Performance loss is marginal - we can live with it.
I also looked at 1.12.0-beta3:
~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 186.92546749999997 ± 10.934245123593346 μs
Lowest time per event 175.50416 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 684.5363275000001 ± 11.514369132124589 μs
Lowest time per event 674.45708 μs
ee H Durham
Processed 100 events 16 times
Average time per event 29.18406375 ± 1.3979762370840427 μs
Lowest time per event 28.45625 μs
~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 189.20445375 ± 8.512180852002729 μs
Lowest time per event 177.52208 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 683.996173125 ± 9.641498982006677 μs
Lowest time per event 674.17833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 30.5012775 ± 0.44837712099670635 μs
Lowest time per event 29.78125 μs
So the performance loss with 1.12 is more significant and we should report that.
So the performance loss with 1.12 is more significant and we should report that.
I can't think of anything specific that would cuase this. In fact, I think the micro benchmark for findmin() shows that 1.12 and nightly are better than 1.11.5, so I suspect there may be regression in somewhere else.
Another supporting evidence is that even with LV.jl, 1.12 is slower, and LV.jl should be independent of Julia versions (I don't believe LLVM regressed )
FYI I also want to merge the immutable structs patch, #146, before considering merging this one
Just to note, LoopVectorization seems to be broken with Julia nightly for 2 weeks or so
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19
let's see of this pan out?
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19
let's see of this pan out?
Maybe the main work happens in the dependencies but since the announcement they had 9 commits out of which only 1 wasn't documentation, CI or bumping compat
VectorizationBase still has this https://github.com/JuliaSIMD/VectorizationBase.jl/blob/129a0e533202ee3bf1b60a925f74ad153e8bcdd7/README.md?plain=1#L12
Maybe we should get a collecting can and contribute to point 1... 💰 ?
I was experimenting with Reactant.jl and from my understanding it's currently not useful for us as it requires separate compilation for each array size
as it requires separate compilation for each array size
this is correct
As LoopVectorisation lives on, let's close this PR, but keep the branch, should it be needed later.
Branches of closed PRs can always be restored at any point.
There was another problem in context of static compilation since LoopVectorization prevents trimming and also is super slow with PackageCompiler
Noted! But right now LoopVectorisation is still the best solution for the Julia repository. I don't care much about PackageCompiler, but understanding / fixing the trimming issue would be nice.