MLJTuning.jl Adding loggers into TunedModels

Details in alan-turing-institute/MLJ.jl#1029.

Adding parametric type L for loggers (detailed implementation in MLJBase.jl).

Sep 10 '23 19:09 pebeto

Codecov Report

Attention: Patch coverage is 90.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 87.55%. Comparing base (bb59cae) to head (2b63fa8).

Files	Patch %	Lines
src/tuned_models.jl	90.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #193      +/-   ##
==========================================
+ Coverage   87.53%   87.55%   +0.01%     
==========================================
  Files          13       13              
  Lines         666      667       +1     
==========================================
+ Hits          583      584       +1     
  Misses         83       83

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 10 '23 19:09 codecov[bot]

Looking good, thanks!

Does it all look good on the MLflow service when fitting a TunedModel(model, logger=MLFlowLogger(...), ...)?

Sep 11 '23 03:09 ablaom

Looking good locally. I've just uploaded the TunedModel test case JuliaAI/MLJFlow.jl@2153b693ba2dcfb09399ea43614485bbef6d3146

Sep 11 '23 07:09 pebeto

Played around with this some more. Very cool, thanks!

However, there is a problem running in multithread mode. It seem only one thread is logging:

using MLJ
using .Threads
using MLFlowClient
nthreads()
# 5

logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="horse")
X, y = make_moons()
model = (@load RandomForestClassifier pkg=DecisionTree)()

r = range(model, :sampling_fraction, lower=0.4, upper=1.0)

tmodel = TunedModel(
    model;
    range=r,
    logger,
    acceleration=CPUThreads(),
    n=100,
)

mach = machine(tmodel, X, y) |> fit!;
nruns = length(report(mach).history)
# 100

service = MLJFlow.service(logger)
experiment = MLFlowClient.getexperiment(service, "horse")
id = experiment.experiment_id
runs = MLFlowClient.searchruns(service, id);
length(runs)
# 20

@assert length(runs) == nruns
# ERROR: AssertionError: length(runs) == nruns
# Stacktrace:
#  [1] top-level scope
#    @ REPL[166]:1

Sep 11 '23 19:09 ablaom

The problem is we are missing logger in the cloning of the resampling machine happening here:

https://github.com/pebeto/MLJTuning.jl/blob/6f295b7439a9884fa35c16841ded33db2d272227/src/tuned_models.jl#L590

Sep 11 '23 19:09 ablaom

I think CPUProcesses should be fine, but we should add a test for this at MLJFlow.jl (and for CPUThreads).

Sep 11 '23 19:09 ablaom

Thanks for the addition. Sadly, this is still not working for me. I'm getting three experiments, with different id's and same name, "horse" on the server. (I'm only expecting one). One contains 20 evaluations, the other two contains only 1 each, and this complaint is thrown several times:

    {"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""")

Do you have any idea what is happening?

ERROR: TaskFailedException

nested task error: HTTP.Exceptions.StatusError(400, "POST", "/api/2.0/mlflow/experiments/create", HTTP.Messages.Response:
"""
HTTP/1.1 400 Bad Request
Server: gunicorn
Date: Sun, 24 Sep 2023 19:40:45 GMT
Connection: close
Content-Type: application/json
Content-Length: 90

{"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""")
Stacktrace:
  [1] mlfpost(mlf::MLFlow, endpoint::String; kwargs::Base.Pairs{Symbol, Union{Missing, Nothing, String}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:name, :artifact_location, :tags), Tuple{String, Nothing, Missing}}})
    @ MLFlowClient ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:74
  [2] mlfpost
    @ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:66 [inlined]
  [3] createexperiment(mlf::MLFlow; name::String, artifact_location::Nothing, tags::Missing)                                                                                    
    @ MLFlowClient ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:21
  [4] createexperiment
    @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:16 [inlined]
  [5] #getorcreateexperiment#7
    @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:103 [inlined]
  [6] log_evaluation(logger::MLFlowLogger, performance_evaluation::PerformanceEvaluation

{MLJDecisionTreeInterface.RandomForestClassifier, Vector{LogLoss{Float64}}, Vector{Float64}, Vector{typeof(predict)}, Vector{Vector{Float64}}, Vector{Vector{Vector{Float64}}}, Vector{NamedTuple{(:forest,), Tuple{DecisionTree.Ensemble{Float64, UInt32}}}}, Vector{NamedTuple{(:features,), Tuple{Vector{Symbol}}}}, Holdout}) @ MLJFlow ~/.julia/packages/MLJFlow/TqEtw/src/base.jl:2 [7] evaluate!(mach::Machine{MLJDecisionTreeInterface.RandomForestClassifier, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operations::Vector{typeof(predict)}, acceleration::CPU1{Nothing}, force::Bool, logger::MLFlowLogger, user_resampling::Holdout)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1314 [8] evaluate!(::Machine{MLJDecisionTreeInterface.RandomForestClassifier, true}, ::Holdout, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Vector {typeof(predict)}, ::CPU1{Nothing}, ::Bool, ::MLFlowLogger, ::Holdout)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1335 [9] fit(::Resampler{Holdout, MLFlowLogger}, ::Int64, ::Tables.MatrixTable{Matrix{Float64}}, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1494 [10] fit_only!(mach::Machine{Resampler{Holdout, MLFlowLogger}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680 [11] fit_only! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined] [12] #fit!#63 @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined] [13] fit! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined] [14] event!(metamodel::MLJDecisionTreeInterface.RandomForestClassifier, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}, verbosity::Int64, tuning::RandomSearch, history::Nothing, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}})
@ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:443 [15] #46 @ ~/MLJ/MLJTuning/src/tuned_models.jl:597 [inlined] [16] iterate @ ./generator.jl:47 [inlined] [17] _collect(c::Vector{MLJDecisionTreeInterface.RandomForestClassifier}, itr::Base.Generator{Vector{MLJDecisionTreeInterface.RandomForestClassifier}, MLJTuning.var"#46#50"{Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, Channel{Bool}, Vector{Machine{Resampler{Holdout, MLFlowLogger}, false}}, Int64}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1}) @ Base ./array.jl:802 [18] collect_similar @ ./array.jl:711 [inlined] [19] map @ ./abstractarray.jl:3261 [inlined] [20] macro expansion @ ~/MLJ/MLJTuning/src/tuned_models.jl:596 [inlined] [21] (::MLJTuning.var"#45#49"{Vector{MLJDecisionTreeInterface.RandomForestClassifier}, Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, Channel{Bool}, Vector{Any}, Vector{Machine{Resampler{Holdout, MLFlowLogger}, false}}, UnitRange{Int64}, Int64})() @ MLJTuning ./threadingconstructs.jl:373

Sep 24 '23 19:09 ablaom

Interestingly, I'm getting the same kind of error for acceleration=Distributed:

using Distributed
addprocs(2)

nprocs()
# 3

using MLJ
using MLFlowClient
logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="rock")

X, y = make_moons()
model = (@iload RandomForestClassifier pkg=DecisionTree)()

r = range(model, :sampling_fraction, lower=0.4, upper=1.0)

tmodel = TunedModel(
    model;
    range=r,
    logger,
    acceleration=CPUProcesses(),
    n=100,
)

mach = machine(tmodel, X, y) |> fit!;

[ Info: Training machine(ProbabilisticTunedModel(model = RandomForestClassifier(max_depth = -1, …), …), …).
[ Info: Attempting to evaluate 100 models.
      From worker 3:    ┌ Error: Problem fitting the machine machine(Resampler(model = RandomForestClassifier(max_depth = -1, …), …), …). 
      From worker 3:    └ @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:682
      From worker 3:    [ Info: Running type checks... 
      From worker 3:    [ Info: Type checks okay. 
Evaluating over 100 metamodels:  50%[============>            ]  ETA: 0:00:15┌ Error: Proble
m fitting the machine machine(ProbabilisticTunedModel(model = RandomForestClassifier(max_depth = -1, …), …), …). 
└ @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:682
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:349 [inlined]
  [2] fetch
    @ ./task.jl:369 [inlined]
  [3] preduce(reducer::Function, f::Function, R::Vector{MLJDecisionTreeInterface.RandomForestClassifier})                                          
    @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:274
  [4] macro expansion
    @ ~/MLJ/MLJTuning/src/tuned_models.jl:521 [inlined]
  [5] macro expansion
    @ ./task.jl:476 [inlined]
  [6] assemble_events!(metamodels::Vector{MLJDecisionTreeInterface.RandomForestClassifier}, 
resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}, verbosity::Int64, tuning::RandomSearch, history::Nothing, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, acceleration::CPUProcesses{Nothing})
    @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:502
  [7] build!(history::Nothing, n::Int64, tuning::RandomSearch, model::MLJDecisionTreeInterface.RandomForestClassifier, model_buffer::Channel{Any}, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, verbosity::Int64, acceleration::CPUProcesses{Nothing}, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false})                                                           
    @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:675
  [8] fit(::MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, ::Int64, ::Tables.MatrixTable{Matrix{Float64}}, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})         
    @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:756
  [9] fit_only!(mach::Machine{MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)                                                            
    @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680
 [10] fit_only!
    @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined]
 [11] #fit!#63
    @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined]
 [12] fit!
    @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined]
 [13] |>(x::Machine{MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, false}, f::typeof(fit!))
    @ Base ./operators.jl:907
 [14] top-level scope
    @ REPL[16]:1

    nested task error: On worker 3:
    HTTP.Exceptions.StatusError(400, "POST", "/api/2.0/mlflow/experiments/create", HTTP.Messages.Response:
    """
    HTTP/1.1 400 Bad Request
    Server: gunicorn
    Date: Sun, 24 Sep 2023 20:07:23 GMT
    Connection: close
    Content-Type: application/json
    Content-Length: 89
    
    {"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'rock' already exists."}""")
    Stacktrace:
      [1] #mlfpost#3
        @ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:74
      [2] mlfpost
        @ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:66 [inlined]
      [3] #createexperiment#6
        @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:21
      [4] createexperiment
        @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:16 [inlined]
      [5] #getorcreateexperiment#7
        @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:103 [inlined]
      [6] log_evaluation
        @ ~/.julia/packages/MLJFlow/TqEtw/src/base.jl:2
      [7] evaluate!
        @ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1314
      [8] evaluate!
        @ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1335
      [9] fit
        @ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1494
     [10] #fit_only!#57
        @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680
     [11] fit_only!
        @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined]
     [12] #fit!#63
        @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined]
     [13] fit!
        @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined]
     [14] event!
        @ ~/MLJ/MLJTuning/src/tuned_models.jl:443
     [15] macro expansion
        @ ~/MLJ/MLJTuning/src/tuned_models.jl:522 [inlined]
     [16] #39
        @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:288
     [17] #invokelatest#2
        @ ./essentials.jl:816
     [18] invokelatest
        @ ./essentials.jl:813
     [19] #110
        @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285
     [20] run_work_thunk
        @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
     [21] macro expansion
        @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [inlined]
     [22] #109
        @ ./task.jl:514
    Stacktrace:
     [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})                      
       @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:465
     [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any})
       @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
     [3] #remotecall_fetch#162
       @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
     [4] remotecall_fetch
       @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
     [5] (::Distributed.var"#175#176"{typeof(vcat), MLJTuning.var"#39#42"{Machine{Resampler{Holdout, MLFlowLogger}, false}, Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, RemoteChannel{Channel{Bool}}}, Vector{MLJDecisionTreeInterface.RandomForestClassifier}, Vector{UnitRange{Int64}}, Int64, Int64})()
       @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:270

Sep 24 '23 20:09 ablaom

Okay, see here for a MWE: https://github.com/JuliaAI/MLFlowClient.jl/issues/40

Sep 24 '23 20:09 ablaom

Revisiting this issue after a few months.

It looks like the multithreading issue is not likely to be addressed soon. Perhaps we can proceed with this PR, after strictly ruling out logging for the parallel modes. For example, if logger is different from nothing, and either acceleration or acceleration_resampling are different from CPU1(), then clean! resets the accelerations to CPU() and issues a message saying what it has done and why. The clean! code is here.

@pebeto What do you think?

Jan 23 '24 19:01 ablaom

The solution to this issue is not part of the mlflow plans (see https://github.com/mlflow/mlflow/issues/11122). However, a workaround is presented here: https://github.com/JuliaAI/MLJFlow.jl/pull/36 to ensure our process is thread-safe.

Mar 07 '24 18:03 pebeto