Turing.jl icon indicating copy to clipboard operation
Turing.jl copied to clipboard

Test Enzyme and reexport `ADTypes.AutoEnzyme`

Open devmotion opened this issue 3 years ago • 107 comments

Note: This does not work yet


I opened this PR to make it easier to debug (and possibly fix) issues with Enzyme.

Currently, the following example does ~~not~~ work (note that the snippet does not require the PR which solely reexports AutoEnzyme at this point):

using Turing
using Enzyme
using ADTypes
Enzyme.API.runtimeActivity!(true);
Enzyme.API.typeWarning!(false);

@model function model()
    m ~ Normal(0, 1)
    s ~ InverseGamma()
    x ~ Normal(m, s)
end

sample(model() | (; x=0.5), NUTS(; adtype = ADTypes.AutoEnzyme()), 10)

~~With Enzyme#main my Julia (1.8.1) segfaults. An incomplete (it filled my whole terminal) output: https://gist.github.com/devmotion/1352197f2354c6fecddd7b778ec4bcf7#file-log-txt~~

The example works (latest releases of Turing, Enzyme, and ADTypes on Julia 1.10.0) but the following warnings show up:

warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors

devmotion avatar Sep 28 '22 10:09 devmotion

Pull Request Test Coverage Report for Build 10401483936

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 9 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.6%) to 85.422%

Files with Coverage Reduction New Missed Lines %
src/mcmc/hmc.jl 9 76.43%
<!-- Total: 9
Totals Coverage Status
Change from base Build 10322735917: -0.6%
Covered Lines: 1377
Relevant Lines: 1612

💛 - Coveralls

coveralls avatar Nov 13 '22 02:11 coveralls

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 85.15%. Comparing base (07cc40b) to head (2115d52).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1887      +/-   ##
==========================================
- Coverage   85.77%   85.15%   -0.62%     
==========================================
  Files          24       24              
  Lines        1617     1617              
==========================================
- Hits         1387     1377      -10     
- Misses        230      240      +10     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 13 '22 02:11 codecov[bot]

Also if you want to disable the warnings you can set it like so (https://github.com/EnzymeAD/Enzyme.jl/blob/c29e6119c7963ddb22f1363726f762455748e193/src/api.jl#L414 )

Enzyme.API.typeWarning!(false)

wsmoses avatar Jun 26 '23 18:06 wsmoses

You also may want to set the version to 0.11.2 since your CI currently is running at 0.11.0 (⌃ [7da242da] Enzyme v0.11.0)

wsmoses avatar Jun 26 '23 18:06 wsmoses

@devmotion this PR (https://github.com/EnzymeAD/Enzyme.jl/pull/914) should fix the immediate issues you see on CI if you want to try.

wsmoses avatar Jun 27 '23 05:06 wsmoses

I ran CI locally with Enzyme#main but after a few successful Enzyme tests (:tada:) Julia crashed:

[ Info: Testing enzyme
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 2.1311108881726257)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.2093153528065796)
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 2.008343613152304)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.1630537420596996)
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 1.982602290442361)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.14695052822212)
[ Info: (symbol = :s, exact = 2.0416666666666665, evaluated = 2.046767670482908)
[ Info: (symbol = :m, exact = 1.1666666666666667, evaluated = 1.1588623837104777)
warning: didn't implement memmove, using memcpy as fallback which can result in errors
GC error (probable corruption) :
Allocations: 6315597903 (Pool: 6308179580; Big: 7418323); GC: 10555

!!! ERROR in jl_ -- ABORTING !!!
0x7f60b8885010: Queued root: 0x7f5eda57ad50 :: 0x7f604de8d0d0 (bits: 3)
        of type Array{Turing.Inference.GibbsTransition{NamedTuple{(:mu1, :mu2, :z1, :z2, :z3, :z4), Tuple{Tuple{Array{Float64, 1}, Array{String, 1}}, Tuple{Array{Float64, 1}, Array{String, 1}}, Vararg{Tuple{Array{Int64, 1}, Array{String, 1}}, 4}}}, Float64}, 1}
0x7f60b8885028: Queued root: 0x7f5ee20676d0 :: 0x7f60ef038b40 (bits: 3)
        of type Core.Box
0x7f60b8885040: Queued root: 0x7f5ee20696c0 :: 0x7f60ef038b40 (bits: 3)
        of type Core.Box
0x7f60b8885058: Queued root: 0x7f5efda0f530 :: 0x7f60f051c730 (bits: 3)
        of type Task
...
0x7f60b88946d0: Queued root: 0x7f60746341a0 :: 0x7f60f051c730 (bits: 3)
        of type Task
0x7f60b88946e8:  r-- Stack frame 0x7ffe13383010 -- 186 of 374 (direct)
0x7f60b8894710:   `- Object (16bit) 0x7f5ed1b2c710 :: 0x7f5ed9b3ba11 -- [30, 51)
        of type Tuple{UInt64, Float64, NamedTuple{(Symbol("1"), Symbol("2")), Tuple{NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3")), Tuple{NTuple{6, NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5"), Symbol("6"), Symbol("7"), Symbol("8")), NTuple{8, Any}}}, Any, Any}}, Any}}}

[8785] signal (6.-6): Aborted
in expression starting at /home/david/.julia/packages/Turing/PbWOa/test/inference/gibbs.jl:1
__pthread_kill_implementation at /lib64/libc.so.6 (unknown line)
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
gc_assert_datatype_fail at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1912
gc_mark_loop at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3020
_jl_gc_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3400
ijl_gc_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3707
maybe_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1078 [inlined]
jl_gc_pool_alloc_inner at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1443 [inlined]
jl_gc_pool_alloc_noinline at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:1504 [inlined]
jl_gc_alloc_ at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia_internal.h:460 [inlined]
jl_gc_alloc at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gc.c:3754
unknown function (ip: 0x7f606c239ee1)
unknown function (ip: 0x7f606c24a793)
Allocations: 6315597903 (Pool: 6308179580; Big: 7418323); GC: 10555

devmotion avatar Jun 27 '23 20:06 devmotion

@devmotion bump the version to 0.11.4 to fix one of the issues CI found

wsmoses avatar Jul 07 '23 20:07 wsmoses

@devmotion okay now retry with https://github.com/EnzymeAD/Enzyme.jl/pull/938

wsmoses avatar Jul 07 '23 21:07 wsmoses

I'll bump to 11.5 once it is released :slightly_smiling_face:

devmotion avatar Jul 07 '23 21:07 devmotion

@devmotion yeah unfortuantely the fix remedies something in the error handler itself (meaning that the actual error would still be hidden). If you can let me know what that is I can try to get that out with the next patch bump too

wsmoses avatar Jul 07 '23 22:07 wsmoses

@devmotion I've told the julia registrar bot to start the 0.11.5 Enzyme tag. Once that lands, want to give it another go?

wsmoses avatar Jul 13 '23 16:07 wsmoses

2023-07-13T20:41:25.5436080Z   No create nofree of empty function memmove
2023-07-13T20:41:25.5436226Z   declare i64 @memmove(i64, i64, i64) local_unnamed_addr

Well that should be an easy fix. @devmotion want to make a PR to add that to this list: https://github.com/EnzymeAD/Enzyme.jl/blob/82b12a7158e158755abb348d7c914607f9ed8b57/src/compiler.jl#L124

wsmoses avatar Jul 13 '23 21:07 wsmoses

@devmotion since I think doing a single fix per release is probably too slow to get this happy in a reasonable time, would you consider rerunning this on Enzyme#main to identify (and post an issue for) whatever issue comes next?

That way we can get this through more quickly than getting bottlenecked by releases.

wsmoses avatar Jul 14 '23 19:07 wsmoses

cc @sethaxen who may be able to assist with running Turing and Enzyme.jl on main

wsmoses avatar Jul 14 '23 19:07 wsmoses

The interaction of AD with particle MCMC might cause issues due to the complex old VarInfo design (hopefully, we will finish the migration to SimpleVarInfo for Gibbs in the next few weeks), varying dimensionality, etc. We could skip Gibbs-related tests before we iron out other bugs.

In addition, it might be a good idea to add Enzyme tests to DistributionsAD for more unit tests. It is much easier to debug DistributionsAD than Turing, and all bugs discovered there will help Turing and other Julia libraries.

cc @torfjelde

yebai avatar Jul 15 '23 08:07 yebai

I added Enzyme tests to DistributionsAD a while ago (but maybe only in a local branch it seems?) but at that point there were too many test failures that I could not decipher. Likely a lot better these days.

devmotion avatar Jul 15 '23 08:07 devmotion

I added Enzyme tests to DistributionsAD a while ago (but maybe only in a local branch it seems?) but at that point there were too many test failures that I could not decipher. Likely a lot better these days.

Maybe open a PR to track the CI results?

yebai avatar Jul 15 '23 08:07 yebai

@devmotion can you hit it with another re-run on main ?

wsmoses avatar Jul 23 '23 16:07 wsmoses

@devmotion it looks like the function its tripping over is lowered as julia__simplex_inv_bijector__71959 if that gives any insight as to where that comes from

wsmoses avatar Jul 24 '23 03:07 wsmoses

Based on the name, I assume that's https://github.com/TuringLang/Bijectors.jl/blob/9cd59070871cc7a29df0e401a24a08502241b230/src/bijectors/simplex.jl#L84 or https://github.com/TuringLang/Bijectors.jl/blob/9cd59070871cc7a29df0e401a24a08502241b230/src/bijectors/simplex.jl#L102. Can you see any immediate issues for Enzyme (apart from that it's a bit "ugly" 😅)?

devmotion avatar Jul 24 '23 07:07 devmotion

Can you try https://github.com/EnzymeAD/Enzyme.jl/pull/962 and see if that fixes?

wsmoses avatar Jul 24 '23 08:07 wsmoses

@devmotion now with the more advanced version of addr13 landed on main, can you retry main?

wsmoses avatar Jul 27 '23 18:07 wsmoses

@devmotion rerunning main will now reduce the errors, but at one point also we should consider merging a version with some tests marked as broken, and you opening distinct issues for those.

wsmoses avatar Jul 27 '23 22:07 wsmoses

Marking tests as broken won't be sufficient as long as tests cause Julia segfaults. But of course we could just not test some models and samplers with Enzyme.

The unfortunate consequence is that this means we only have some experimental support for Enzyme. The models and samplers in the tests are neither very complicated nor challenging IIRC, so it seems likely that users also experience these issues and segfaults. If we would understand what's the issue with the failing tests, we could maybe at least teach users about which types of models, samplers, or code patterns are problematic (similar to how indexing with Zygote leads to poor performance or tape compilation with ReverseDiff can yield incorrect results if the models contain run-time branches). I think users would prefer such guidelines because they would give a better prior for whether a model is Enzyme-compatible or causes segfaults and they would make it easier to write Enzyme-compatible Turing models (in particular since I assume for users Enzyme stacktraces and segfaults are even more intimidating than (already challenging) stacktraces of other AD backends).

devmotion avatar Jul 28 '23 18:07 devmotion

Yeah that's why I've been trying to eliminate all error types that would segfault (if you wouldnt mind re-running main to see if that succeeded).

But yeah since there's a lot of tests it would be helpful to isolate which cause issues and post them as individual issues to be debugged, hence the proposal above.

wsmoses avatar Jul 28 '23 20:07 wsmoses

Out of curiosity, does anyone have an idea (or just a wild guess) about the magnitude of potential or expected performance improvements that this could bring?


EDIT: in case others are interested, this example reports a speedup from 29.35 seconds (Zygote) to 0.69 s (Enzyme)... I try to manage my expectations but damn that's promising 😍🚀

DominiqueMakowski avatar Jul 29 '23 09:07 DominiqueMakowski

Yes, there's a performance comparison in the linked issue in the Enzyme repo.

devmotion avatar Jul 29 '23 09:07 devmotion

I am also concerned about segfaults when considering merging this PR. I understand the need to isolate the issues to debug and improve Enzyme.

The DistributionsAD package contains a large set of unit tests for various probability density functions from the Distributions package. So I think this PR is a better place to measure Enzyme's current robustness—ideally, all tests there should pass before merging this PR.

yebai avatar Aug 01 '23 08:08 yebai

Oh for sure, but at this point I've gone through and fixed what I saw as any obvious things that can be fixed through the logs. So at this point any fixes will require opening an issue with the specific (ideally minimal) code that trigegrs the error.

If @yebai or @devmotion are able to isolate whatever code snippets here are creating segfaults into issues on Enzyme.jl, I can fix them, and then cut a release with the fixes. Without issues opened on what triggers the failure, I'm not able to fix them =/.

wsmoses avatar Aug 01 '23 19:08 wsmoses

@devmotion @yebai can you trigger a test re-run to see where this is at. I'd do it myself like I did for DistributionsAD.jl, but I don't seem to have permission here

wsmoses avatar Aug 07 '23 20:08 wsmoses