Ribasim icon indicating copy to clipboard operation
Ribasim copied to clipboard

Use PrecompileTools to compile basic model on build.

Open evetion opened this issue 9 months ago • 5 comments

Should help with latency (#1942) on our executables. The basic testmodel runs in half a second, and models related to basic see improvements too (but much less) with this PR.

This sets up a precompile workload using PrecompileTools, which also compiles the statements of other packages with our types, instead of just Ribasim. This also sets the specialization level of SciML to full, which is recommended for simulations.

However, as both will significantly increase compile time (and the model printing progress during it), I've disabled it by default for developers.

I believe that if we fix the type explosion of our Parameters we will have a fast executable for all models, although we ideally should have a workload that includes all our functionality (arrow, alloc, etc.).

evetion avatar Feb 28 '25 12:02 evetion

Wow, this crashes hard on TC:

Windows

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
in expression starting at C:\Users\svc-teamcity-ansible\.julia\packages\SimpleNonlinearSolve\1h0BO\src\SimpleNonlinearSolve.jl:143
unknown function (ip: 0000000000000000)
Allocations: 38904371 (Pool: 38903382; Big: 989); GC: 30
ERROR: failed process: Process(`'C:\Users\svc-teamcity-ansible\.julia\juliaup\julia-1.11.3+0.x64.w64.mingw32\bin\julia.exe' --color=auto --startup-file=no --pkgimages=no '--cpu-target=generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)' '--sysimage=D:\buildAgent\temp\buildTmp\jl_nXpMnc\tmp_sys.dll' '--project=D:\buildAgent\work\ecd2b8f9b25b1609\ribasim\core' '--output-o=D:\buildAgent\temp\buildTmp\jl_7EHglq5Wd4-o.a' 'D:\buildAgent\temp\buildTmp\jl_HOzqJyl6d4'`, ProcessExited(1)) [1]

Linux

[1083399] signal 11 (1): Segmentation fault
in expression starting at /u/svc-teamcity-ansible/.julia/packages/SimpleNonlinearSolve/1h0BO/src/SimpleNonlinearSolve.jl:143
unknown function (ip: (nil))
Allocations: 38164904 (Pool: 38163932; Big: 972); GC: 29
ERROR: failed process: Process(`/u/svc-teamcity-ansible/.julia/juliaup/julia-1.11.3+0.x64.linux.gnu/bin/julia --color=auto --startup-file=no --pkgimages=no '--cpu-target=generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)' --sysimage=/opt/teamcityagent/temp/buildTmp/jl_J234qA/tmp_sys.so --project=/opt/teamcityagent/work/ecd2b8f9b25b1609/ribasim/core --output-o=/opt/teamcityagent/temp/buildTmp/jl_qFglC5m7en-o.a /opt/teamcityagent/temp/buildTmp/jl_QPqpwR412H`, ProcessSignaled(11)) [0]

I think that's the fullspecialize? Also note that the second run of the testmodel is not instant, if anything it's slower. @visr, could you build locally and see what happens?

evetion avatar Feb 28 '25 14:02 evetion

I'll first finish the state Vector, which may help here. And if that doesn't help lets try disabling FullSpecialize.

Also discussed with @evetion that if all goes well we may be able to only use PrecompileTools and not PackageCompiler precompilation on top.

And if PrecompileTools speeds up other models as well we should always enable it.

visr avatar Feb 28 '25 14:02 visr

Still failing at PackageCompiler creating a sysimage:

15:07:01   Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
15:07:01   Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
15:07:01   in expression starting at C:\Users\svc-teamcity-ansible\.julia\packages\SimpleNonlinearSolve\1h0BO\src\SimpleNonlinearSolve.jl:143
15:07:01   unknown function (ip: 0000000000000000)
15:07:01   Allocations: 37311783 (Pool: 37310847; Big: 936); GC: 29

This is possibly because with PrecompileTools it will compile more good leading to a sysimage that is too large?

The line that it is failing is also using PrecompileTools. https://github.com/SciML/NonlinearSolve.jl/blob/v4.4.0/lib/SimpleNonlinearSolve/src/SimpleNonlinearSolve.jl#L143

Sounds like we need to put this on hold for a little longer.

visr avatar Mar 03 '25 20:03 visr

On Linux this seems to work now (even though the build log doesn't look like it), as tested on our cluster:

(base) pronk_mn@v-slurmsub001 ~ $ time ./ribasim/ribasim basic/ribasim.toml
┌ Info: Starting a Ribasim simulation.
│   toml_path = "basic/ribasim.toml"
│   cli.ribasim_version = "2025.1.0"
│   starttime = 2020-01-01T00:00:00
└   endtime = 2021-01-01T00:00:00
┌ Warning: The following experimental features are enabled: concentration
└ @ Ribasim /opt/teamcityagent/work/ecd2b8f9b25b1609/ribasim/core/src/main.jl:42
Simulating 100%|████████████████████████████████████████████████████████████| Time: 0:00:00
┌ Info: Convergence bottlenecks in descending order of severity:
│   LinearResistance #12 = 8.149730746987561e-6
│   Basin #6 = 2.7041589660110494e-6
│   TabulatedRatingCurve #4 = 1.0239692169298505e-6
│   Basin #3 = 1.0238832862484456e-6
└   Basin #1 = 9.825234235602696e-7
[ Info: The model finished successfully.

real	0m1.987s
user	0m1.833s
sys	0m0.211s

@visr If you can confirm this for the Windows build, I'd say this is good to merge.

evetion avatar Mar 04 '25 09:03 evetion

Comparing the CLI from this branch against main I don't see any speedup from this at all. The basic model is 1s for both, and the trivial model is 16s for both. Even though trivial is a subset of basic feature wise, hinting that perhaps #2127 won't help much at this point either.

That doesn't mean we shouldn't merge it though, hopefully this will start making a difference with further latency work.

visr avatar Mar 04 '25 11:03 visr

@SouthEndMusic I've set the AD chunk_size to 1 here, which causes the subnetwork test models to become unstable. Could you take a look why that is?

evetion avatar Jun 18 '25 12:06 evetion

@visr How do you want to move on with this PR? If we want to tackle the cli latency, we need to build two executables.

evetion avatar Jun 20 '25 08:06 evetion

Perhaps we should focus this PR only on latency for CI and developers and tackle the CLI latency separately? Before introducing two CLIs it would be nice to have more data on the tradeoff, to see if it is worth it at all.

visr avatar Jun 20 '25 08:06 visr

Perhaps we should focus this PR only on latency for CI and developers and tackle the CLI latency separately? Before introducing two CLIs it would be nice to have more data on the tradeoff, to see if it is worth it at all.

Ok, this is now good to go. Please review :)

evetion avatar Jun 20 '25 13:06 evetion

Julia testmodels now runs in 10-14 minutes, where it was 20-30 minutes.

evetion avatar Jun 20 '25 13:06 evetion