MLJBase.jl icon indicating copy to clipboard operation
MLJBase.jl copied to clipboard

Julia crashes for multithreaded Stack for some non-Julia models

Open ablaom opened this issue 2 years ago • 3 comments

Context: #767 adds support for an option acceleration=CPUThreads() in composite model types defined by "exporting" learning networks, and implements this option for Stack. I have been carrying out MLJ ecosystem integration tests of the new Stack with a large number of models as base models in the stack. If the base model is one from the non-Julia packages ScikitLearn.jl, XGBoost.jl, or LIBSVM.jl, and I am including CPUThreads() in the testing, then I am experiencing Julia crashes. I not been able to reliably reproduce the crashes with a "minimal example" but the follow seems to do the job on my machine:

using Pkg
Pkg.activate(temp=true)
Pkg.add(
    url="https://github.com/JuliaAI/MLJBase.jl",
    rev="stack_cache_and_acceleration",
)
Pkg.add(
    url = "https://github.com/JuliaAI/MLJTestIntegration.jl",
    rev= "multi-threading",
)
Pkg.add("NearestNeighborModels")
Pkg.add("MLJLIBSVMInterface")
Pkg.add("XGBoost")
Pkg.instantiate()

julia> Pkg.status()
      Status `/private/var/folders/4n/gvbmlhdc8xj973001s6vdyw00000gq/T/jl_wRKoZO/Project.toml`          
  [a7f614a8] MLJBase v0.20.2 `https://github.com/JuliaAI/MLJBase.jl#stack_cache_and_acceleration`        
  [61c7150f] MLJLIBSVMInterface v0.2.0
  [697918b4] MLJTestIntegration v0.1.0 `https://github.com/JuliaAI/MLJTestIntegration.jl#multi-threading`
  [636a865e] NearestNeighborModels v0.2.0
  [009559a3] XGBoost v1.5.2

using MLJBase
using NearestNeighborModels
using MLJLIBSVMInterface
using MLJTestIntegration
using XGBoost

model = EpsilonSVR()

models = (knn1=KNNRegressor(K=4),
          knn2=KNNRegressor(K=6),
          model=model)

metalearner = KNNRegressor()
measure = LPLoss(2)

# mini Boston:
y, X = unpack(MLJBase.load_boston(), ==(:MedV), col->col in [:LStat, :Rm])
data = (X, y)

mystack = Stack(
    ; metalearner,
    resampling=CV(;nfolds=3),
    acceleration=CPUThreads(),
    models...)

julia> MLJTestIntegration.test_single_target_regressors(
    [(name="EpsilonSVR", package_name="LIBSVM"),],
    level=4,
    verbosity=2
)
┌ Info: 
└ Testing EpsilonSVR from LIBSVM
[ Info: [:model_type] Loading model type ✓
[ Info: [:model_instance] Instantiating default model ✓
[ Info: [:fitted_machine] Fitting machine ✓
[ Info: [:operations] Calling `predict`, `transform` and/or `inverse_transform` ✓
[ Info: [evaluation] Evaluating model performance using with 1 resources. ✓
Internal repeatability tests, 50 of 50 trials complete ✓ Repeatable.
[ Info: Testing with 5 threads. 
[ Info: [:accelerated_evaluation] Evaluating model performance using with 2 resources. ✓
[ Info: [:tuned_pipe_evaluation] Evaluating perfomance in a tuned pipeline ✓
[ Info: [:ensemble_prediction] Ensembling ✓
[ Info: [stack_evaluation] Evaluating a stack containing model with 1 resources. ✓

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43
unknown function (ip: 0x10b82aca3)
Allocations: 279946573 (Pool: 279865905; Big: 80668); GC: 248

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43
unknown function (ip: 0x10b80f59c)
Allocations: 279946573 (Pool: 279865905; Big: 80668); GC: 248
...

Interestingly, if I remove MLJXGBoostInterface from the env, and the using XGBoost, then there are no issues and the tests pass.

I do not seem to have problems with any pure Julia models.

In attempts to isolate, I have encountered various errors, such as:

OMP: Error #13: Assertion failure at kmp_csupport.cpp(540).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.

signal (6): Abort trap: 6
in expression starting at REPL[2]:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 105303846 (Pool: 105260636; Big: 43210); GC: 106

julia(70986,0x70000783d000) malloc: *** error for object 0x7ff0725333e0: pointer being freed was not allocated
julia(70986,0x70000783d000) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at /Users/anthony/sandbox/crash.jl:46

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:46
Allocations: 279191441 (Pool: 279111122; Big: 80319); GC: 222

julia(90542,0x7000079c6000) malloc: Incorrect checksum for freed object 0x7f8da2b121a8: probably modified after being freed.
Corrupt value: 0x7f8da2b1b4c0
julia(90542,0x7000079c6000) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at /Users/anthony/MLJ/MLJTestIntegration/examples/bigtest/notebook.jl:35

signal (4): Illegal instruction: 4
in expression starting at /Users/anthony/MLJ/MLJTestIntegration/examples/bigtest/notebook.jl:35

I am running with 5 threads.

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 5
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia

ablaom avatar Jun 09 '22 03:06 ablaom

Interesting - I get these problems (intermittently) as well on an M1 mac with non-Julia models (XGBoost, LightGBM, etc) - but I get it when I do cross validation (calling evaluate) with multi-threading enabled. It is similarly hard for me to generate a minimal example but I get the same exceptions / seg faults that you do.

pazzo83 avatar Jun 09 '22 03:06 pazzo83

Same thing here, a simple loop with only an SVM in the Stack produces the error on my side if that helps:


metalearner = EpsilonSVR()
models = (model=EpsilonSVR(),)
mystack = Stack(
    ; metalearner,
    resampling=CV(;nfolds=3),
    cache=false,
    acceleration=CPUThreads(),
    models...)

for i in 1:3
    fitresult,_, _ = fit(mystack, 0, X, y)
end

I noticed LIBSVM also has internal multithreading, could that be related?

olivierlabayle avatar Jun 14 '22 08:06 olivierlabayle

It appears LIBSVM isn't thread safe https://github.com/JuliaML/LIBSVM.jl/issues/60

OkonSamuel avatar Jul 01 '22 11:07 OkonSamuel