Test Enzyme and reexport `ADTypes.AutoEnzyme`
Note: This does not work yet
I opened this PR to make it easier to debug (and possibly fix) issues with Enzyme.
Currently, the following example does ~~not~~ work (note that the snippet does not require the PR which solely reexports AutoEnzyme at this point):
using Turing
using Enzyme
using ADTypes
Enzyme.API.runtimeActivity!(true);
Enzyme.API.typeWarning!(false);
@model function model()
m ~ Normal(0, 1)
s ~ InverseGamma()
x ~ Normal(m, s)
end
sample(model() | (; x=0.5), NUTS(; adtype = ADTypes.AutoEnzyme()), 10)
~~With Enzyme#main my Julia (1.8.1) segfaults. An incomplete (it filled my whole terminal) output: https://gist.github.com/devmotion/1352197f2354c6fecddd7b778ec4bcf7#file-log-txt~~
The example works (latest releases of Turing, Enzyme, and ADTypes on Julia 1.10.0) but the following warnings show up:
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
Pull Request Test Coverage Report for Build 10401483936
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 9 unchanged lines in 1 file lost coverage.
- Overall coverage decreased (-0.6%) to 85.422%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| src/mcmc/hmc.jl | 9 | 76.43% |
| <!-- | Total: | 9 |
| Totals | |
|---|---|
| Change from base Build 10322735917: | -0.6% |
| Covered Lines: | 1377 |
| Relevant Lines: | 1612 |
💛 - Coveralls
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 85.15%. Comparing base (
07cc40b) to head (2115d52).
Additional details and impacted files
@@ Coverage Diff @@
## master #1887 +/- ##
==========================================
- Coverage 85.77% 85.15% -0.62%
==========================================
Files 24 24
Lines 1617 1617
==========================================
- Hits 1387 1377 -10
- Misses 230 240 +10
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Also if you want to disable the warnings you can set it like so (https://github.com/EnzymeAD/Enzyme.jl/blob/c29e6119c7963ddb22f1363726f762455748e193/src/api.jl#L414 )
Enzyme.API.typeWarning!(false)
You also may want to set the version to 0.11.2 since your CI currently is running at 0.11.0 (⌃ [7da242da] Enzyme v0.11.0)
@devmotion this PR (https://github.com/EnzymeAD/Enzyme.jl/pull/914) should fix the immediate issues you see on CI if you want to try.
I ran CI locally with Enzyme#main but after a few successful Enzyme tests (:tada:) Julia crashed:
[ Info: Testing enzyme
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 2.1311108881726257)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.2093153528065796)
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 2.008343613152304)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.1630537420596996)
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 1.982602290442361)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.14695052822212)
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 2.046767670482908)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.1588623837104777)
warning: didn't implement memmove, using memcpy as fallback which can result in errors
GC error (probable corruption) :
Allocations: 6315597903 (Pool: 6308179580; Big: 7418323); GC: 10555
!!! ERROR in jl_ -- ABORTING !!!
0x7f60b8885010: Queued root: 0x7f5eda57ad50 :: 0x7f604de8d0d0 (bits: 3)
of type Array{Turing.Inference.GibbsTransition{NamedTuple{(:mu1, :mu2, :z1, :z2, :z3, :z4), Tuple{Tuple{Array{Float64, 1}, Array{String, 1}}, Tuple{Array{Float64, 1}, Array{String, 1}}, Vararg{Tuple{Array{Int64, 1}, Array{String, 1}}, 4}}}, Float64}, 1}
0x7f60b8885028: Queued root: 0x7f5ee20676d0 :: 0x7f60ef038b40 (bits: 3)
of type Core.Box
0x7f60b8885040: Queued root: 0x7f5ee20696c0 :: 0x7f60ef038b40 (bits: 3)
of type Core.Box
0x7f60b8885058: Queued root: 0x7f5efda0f530 :: 0x7f60f051c730 (bits: 3)
of type Task
...
0x7f60b88946d0: Queued root: 0x7f60746341a0 :: 0x7f60f051c730 (bits: 3)
of type Task
0x7f60b88946e8: r-- Stack frame 0x7ffe13383010 -- 186 of 374 (direct)
0x7f60b8894710: `- Object (16bit) 0x7f5ed1b2c710 :: 0x7f5ed9b3ba11 -- [30, 51)
of type Tuple{UInt64, Float64, NamedTuple{(Symbol("1"), Symbol("2")), Tuple{NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3")), Tuple{NTuple{6, NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5"), Symbol("6"), Symbol("7"), Symbol("8")), NTuple{8, Any}}}, Any, Any}}, Any}}}
[8785] signal (6.-6): Aborted
in expression starting at /home/david/.julia/packages/Turing/PbWOa/test/inference/gibbs.jl:1
__pthread_kill_implementation at /lib64/libc.so.6 (unknown line)
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
gc_assert_datatype_fail at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1912
gc_mark_loop at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3020
_jl_gc_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3400
ijl_gc_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3707
maybe_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1078 [inlined]
jl_gc_pool_alloc_inner at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1443 [inlined]
jl_gc_pool_alloc_noinline at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1504 [inlined]
jl_gc_alloc_ at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia_internal.h:460 [inlined]
jl_gc_alloc at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3754
unknown function (ip: 0x7f606c239ee1)
unknown function (ip: 0x7f606c24a793)
Allocations: 6315597903 (Pool: 6308179580; Big: 7418323); GC: 10555
@devmotion bump the version to 0.11.4 to fix one of the issues CI found
@devmotion okay now retry with https://github.com/EnzymeAD/Enzyme.jl/pull/938
I'll bump to 11.5 once it is released :slightly_smiling_face:
@devmotion yeah unfortuantely the fix remedies something in the error handler itself (meaning that the actual error would still be hidden). If you can let me know what that is I can try to get that out with the next patch bump too
@devmotion I've told the julia registrar bot to start the 0.11.5 Enzyme tag. Once that lands, want to give it another go?
2023-07-13T20:41:25.5436080Z No create nofree of empty function memmove
2023-07-13T20:41:25.5436226Z declare i64 @memmove(i64, i64, i64) local_unnamed_addr
Well that should be an easy fix. @devmotion want to make a PR to add that to this list: https://github.com/EnzymeAD/Enzyme.jl/blob/82b12a7158e158755abb348d7c914607f9ed8b57/src/compiler.jl#L124
@devmotion since I think doing a single fix per release is probably too slow to get this happy in a reasonable time, would you consider rerunning this on Enzyme#main to identify (and post an issue for) whatever issue comes next?
That way we can get this through more quickly than getting bottlenecked by releases.
cc @sethaxen who may be able to assist with running Turing and Enzyme.jl on main
The interaction of AD with particle MCMC might cause issues due to the complex old VarInfo design (hopefully, we will finish the migration to SimpleVarInfo for Gibbs in the next few weeks), varying dimensionality, etc. We could skip Gibbs-related tests before we iron out other bugs.
In addition, it might be a good idea to add Enzyme tests to DistributionsAD for more unit tests. It is much easier to debug DistributionsAD than Turing, and all bugs discovered there will help Turing and other Julia libraries.
cc @torfjelde
I added Enzyme tests to DistributionsAD a while ago (but maybe only in a local branch it seems?) but at that point there were too many test failures that I could not decipher. Likely a lot better these days.
I added Enzyme tests to DistributionsAD a while ago (but maybe only in a local branch it seems?) but at that point there were too many test failures that I could not decipher. Likely a lot better these days.
Maybe open a PR to track the CI results?
@devmotion can you hit it with another re-run on main ?
@devmotion it looks like the function its tripping over is lowered as julia__simplex_inv_bijector__71959 if that gives any insight as to where that comes from
Based on the name, I assume that's https://github.com/TuringLang/Bijectors.jl/blob/9cd59070871cc7a29df0e401a24a08502241b230/src/bijectors/simplex.jl#L84 or https://github.com/TuringLang/Bijectors.jl/blob/9cd59070871cc7a29df0e401a24a08502241b230/src/bijectors/simplex.jl#L102. Can you see any immediate issues for Enzyme (apart from that it's a bit "ugly" 😅)?
Can you try https://github.com/EnzymeAD/Enzyme.jl/pull/962 and see if that fixes?
@devmotion now with the more advanced version of addr13 landed on main, can you retry main?
@devmotion rerunning main will now reduce the errors, but at one point also we should consider merging a version with some tests marked as broken, and you opening distinct issues for those.
Marking tests as broken won't be sufficient as long as tests cause Julia segfaults. But of course we could just not test some models and samplers with Enzyme.
The unfortunate consequence is that this means we only have some experimental support for Enzyme. The models and samplers in the tests are neither very complicated nor challenging IIRC, so it seems likely that users also experience these issues and segfaults. If we would understand what's the issue with the failing tests, we could maybe at least teach users about which types of models, samplers, or code patterns are problematic (similar to how indexing with Zygote leads to poor performance or tape compilation with ReverseDiff can yield incorrect results if the models contain run-time branches). I think users would prefer such guidelines because they would give a better prior for whether a model is Enzyme-compatible or causes segfaults and they would make it easier to write Enzyme-compatible Turing models (in particular since I assume for users Enzyme stacktraces and segfaults are even more intimidating than (already challenging) stacktraces of other AD backends).
Yeah that's why I've been trying to eliminate all error types that would segfault (if you wouldnt mind re-running main to see if that succeeded).
But yeah since there's a lot of tests it would be helpful to isolate which cause issues and post them as individual issues to be debugged, hence the proposal above.
Out of curiosity, does anyone have an idea (or just a wild guess) about the magnitude of potential or expected performance improvements that this could bring?
EDIT: in case others are interested, this example reports a speedup from 29.35 seconds (Zygote) to 0.69 s (Enzyme)... I try to manage my expectations but damn that's promising 😍🚀
Yes, there's a performance comparison in the linked issue in the Enzyme repo.
I am also concerned about segfaults when considering merging this PR. I understand the need to isolate the issues to debug and improve Enzyme.
The DistributionsAD package contains a large set of unit tests for various probability density functions from the Distributions package. So I think this PR is a better place to measure Enzyme's current robustness—ideally, all tests there should pass before merging this PR.
Oh for sure, but at this point I've gone through and fixed what I saw as any obvious things that can be fixed through the logs. So at this point any fixes will require opening an issue with the specific (ideally minimal) code that trigegrs the error.
If @yebai or @devmotion are able to isolate whatever code snippets here are creating segfaults into issues on Enzyme.jl, I can fix them, and then cut a release with the fixes. Without issues opened on what triggers the failure, I'm not able to fix them =/.
@devmotion @yebai can you trigger a test re-run to see where this is at. I'd do it myself like I did for DistributionsAD.jl, but I don't seem to have permission here