JetReconstruction.jl use SIMD.jl for x86 and `naive

Just a test if I understand how to replace fast_findmin implementation as in #83

May 14 '25 15:05 m-fila

after fixing that, on Apple M4, I get:

julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32


# main
Processed 100 events 32 times
Average time per event 135.5111340625 ± 2.9835167009602035 μs
Lowest time per event 131.43084000000002 μs

# this PR
Processed 100 events 32 times
Average time per event 138.62045531249998 ± 3.5853189437051123 μs
Lowest time per event 134.35666 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 147.96477937499995 ± 4.669684823430356 μs
Lowest time per event 139.74125 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 152.88778656250003 ± 5.847938860407047 μs
Lowest time per event 142.36916000000002 μs

this looks not bad??!? @graeme-a-stewart Although, there's an unfortuante degredation on 1.12

May 14 '25 17:05 Moelf

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 80.46%. Comparing base (d3c97f3) to head (bfbae69). :warning: Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #151      +/-   ##
==========================================
+ Coverage   80.00%   80.46%   +0.46%     
==========================================
  Files          21       21              
  Lines        1315     1341      +26     
==========================================
+ Hits         1052     1079      +27     
+ Misses        263      262       -1

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 14 '25 17:05 codecov[bot]

Nvidia Grace

julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32

# main
Processed 100 events 32 times
Average time per event 299.6035559375 ± 7.914782725959611 μs
Lowest time per event 280.66586 μs

# this PR
Processed 100 events 32 times
Average time per event 303.1709484375001 ± 8.819781288405308 μs
Lowest time per event 286.57968 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 304.46982093750006 ± 13.922266876931305 μs
Lowest time per event 281.39805 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 308.6548684375 ± 11.580493168517764 μs
Lowest time per event 288.65558 μs

AMD Ryzen 7 5700G

# main
Processed 100 events 32 times
Average time per event 188.28086562500002 ± 5.529621156269813 μs
Lowest time per event 181.3509 μs

# this PR
Processed 100 events 32 times
Average time per event 197.2376540625 ± 6.685658460456483 μs
Lowest time per event 186.34546999999998 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 193.79122999999998 ± 11.42792473599183 μs
Lowest time per event 176.08296 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 205.20970093750003 ± 9.736581454767695 μs
Lowest time per event 187.33396000000002 μs

May 14 '25 18:05 m-fila

AMD Ryzen 7 5700G

yeah ok so we use the SIMD.jl for x86 and this PR for ARM?

May 14 '25 19:05 Moelf

what happened to CI lol...

May 14 '25 19:05 Moelf

For completeness, with SIMD for x86_64

AMD Ryzen 7 5700G

# main
Processed 100 events 32 times
Average time per event 187.38033500000006 ± 5.74337091484025 μs
Lowest time per event 177.03909 μs

# this PR
Processed 100 events 32 times
Average time per event 191.52783343750005 ± 6.355636113548961 μs
Lowest time per event 179.90162 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 195.5927890625 ± 12.11654841840912 μs
Lowest time per event 178.41858 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 194.36534874999998 ± 10.480610090304218 μs
Lowest time per event 176.65239000000003 μs

May 14 '25 19:05 m-fila

https://github.com/JuliaLang/julia/pull/58418

I will test against this PR later

May 15 '25 05:05 Moelf

Would be good to run CI on aarch64 (Linux and/or macOS), no?

May 15 '25 10:05 giordano

julia #58418 on M4
Processed 100 events 32 times
Average time per event 151.89809874999995 ± 5.095528318872066 μs
Lowest time per event 143.37917 μs

so this is same as nightly, and slower than Julia 1.11

May 15 '25 14:05 Moelf

I took a look at performance on my M2 Pro (where the original patch in #83 was sucky). For 1.11.4 I got:

 ~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 171.11460874999997 ± 4.616430628080113 μs
Lowest time per event 165.90792000000002 μs
Processed 100 events 16 times
Average time per event 674.727474375 ± 18.35030922586858 μs
Lowest time per event 656.97167 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.442970625 ± 0.30894069468529917 μs
Lowest time per event 27.61167 μs

and with this patch:

 ~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh 256
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 175.676691875 ± 12.098880482560652 μs
Lowest time per event 167.775 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 671.201953125 ± 5.289937381395405 μs
Lowest time per event 662.80833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.853904375 ± 0.5296152717583557 μs
Lowest time per event 27.69208 μs

Performance loss is marginal - we can live with it.

I also looked at 1.12.0-beta3:

 ~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 186.92546749999997 ± 10.934245123593346 μs
Lowest time per event 175.50416 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 684.5363275000001 ± 11.514369132124589 μs
Lowest time per event 674.45708 μs
ee H Durham
Processed 100 events 16 times
Average time per event 29.18406375 ± 1.3979762370840427 μs
Lowest time per event 28.45625 μs

 ~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 189.20445375 ± 8.512180852002729 μs
Lowest time per event 177.52208 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 683.996173125 ± 9.641498982006677 μs
Lowest time per event 674.17833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 30.5012775 ± 0.44837712099670635 μs
Lowest time per event 29.78125 μs

So the performance loss with 1.12 is more significant and we should report that.

May 19 '25 13:05 graeme-a-stewart

So the performance loss with 1.12 is more significant and we should report that.

I can't think of anything specific that would cuase this. In fact, I think the micro benchmark for findmin() shows that 1.12 and nightly are better than 1.11.5, so I suspect there may be regression in somewhere else.

Another supporting evidence is that even with LV.jl, 1.12 is slower, and LV.jl should be independent of Julia versions (I don't believe LLVM regressed )

May 19 '25 14:05 Moelf

FYI I also want to merge the immutable structs patch, #146, before considering merging this one

May 21 '25 09:05 graeme-a-stewart

Just to note, LoopVectorization seems to be broken with Julia nightly for 2 weeks or so

Aug 27 '25 07:08 m-fila

https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19

let's see of this pan out?

Aug 27 '25 14:08 Moelf

https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19

let's see of this pan out?

Maybe the main work happens in the dependencies but since the announcement they had 9 commits out of which only 1 wasn't documentation, CI or bumping compat

VectorizationBase still has this https://github.com/JuliaSIMD/VectorizationBase.jl/blob/129a0e533202ee3bf1b60a925f74ad153e8bcdd7/README.md?plain=1#L12

Aug 27 '25 14:08 m-fila

Maybe we should get a collecting can and contribute to point 1... 💰 ?

Aug 27 '25 14:08 graeme-a-stewart

I was experimenting with Reactant.jl and from my understanding it's currently not useful for us as it requires separate compilation for each array size

Aug 27 '25 18:08 m-fila

as it requires separate compilation for each array size

this is correct

Aug 27 '25 20:08 Moelf

As LoopVectorisation lives on, let's close this PR, but keep the branch, should it be needed later.

Oct 17 '25 13:10 graeme-a-stewart

Branches of closed PRs can always be restored at any point.

Oct 17 '25 13:10 giordano

There was another problem in context of static compilation since LoopVectorization prevents trimming and also is super slow with PackageCompiler

Oct 17 '25 15:10 m-fila

Noted! But right now LoopVectorisation is still the best solution for the Julia repository. I don't care much about PackageCompiler, but understanding / fixing the trimming issue would be nice.

Oct 20 '25 13:10 graeme-a-stewart

use SIMD.jl for x86 and `naive_findmin` for `:aarch64`

Codecov Report