Adding loggers into TunedModels
Details in alan-turing-institute/MLJ.jl#1029.
- Adding parametric type
Lfor loggers (detailed implementation in MLJBase.jl).
Codecov Report
Attention: Patch coverage is 90.00000% with 1 lines in your changes are missing coverage. Please review.
Project coverage is 87.55%. Comparing base (
bb59cae) to head (2b63fa8).
| Files | Patch % | Lines |
|---|---|---|
| src/tuned_models.jl | 90.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## dev #193 +/- ##
==========================================
+ Coverage 87.53% 87.55% +0.01%
==========================================
Files 13 13
Lines 666 667 +1
==========================================
+ Hits 583 584 +1
Misses 83 83
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Looking good, thanks!
Does it all look good on the MLflow service when fitting a TunedModel(model, logger=MLFlowLogger(...), ...)?
Looking good locally. I've just uploaded the TunedModel test case JuliaAI/MLJFlow.jl@2153b693ba2dcfb09399ea43614485bbef6d3146
Played around with this some more. Very cool, thanks!
However, there is a problem running in multithread mode. It seem only one thread is logging:
using MLJ
using .Threads
using MLFlowClient
nthreads()
# 5
logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="horse")
X, y = make_moons()
model = (@load RandomForestClassifier pkg=DecisionTree)()
r = range(model, :sampling_fraction, lower=0.4, upper=1.0)
tmodel = TunedModel(
model;
range=r,
logger,
acceleration=CPUThreads(),
n=100,
)
mach = machine(tmodel, X, y) |> fit!;
nruns = length(report(mach).history)
# 100
service = MLJFlow.service(logger)
experiment = MLFlowClient.getexperiment(service, "horse")
id = experiment.experiment_id
runs = MLFlowClient.searchruns(service, id);
length(runs)
# 20
@assert length(runs) == nruns
# ERROR: AssertionError: length(runs) == nruns
# Stacktrace:
# [1] top-level scope
# @ REPL[166]:1
The problem is we are missing logger in the cloning of the resampling machine happening here:
https://github.com/pebeto/MLJTuning.jl/blob/6f295b7439a9884fa35c16841ded33db2d272227/src/tuned_models.jl#L590
I think CPUProcesses should be fine, but we should add a test for this at MLJFlow.jl (and for CPUThreads).
Thanks for the addition. Sadly, this is still not working for me. I'm getting three experiments, with different id's and same name, "horse" on the server. (I'm only expecting one). One contains 20 evaluations, the other two contains only 1 each, and this complaint is thrown several times:
{"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""")
Do you have any idea what is happening?
ERROR: TaskFailedException
nested task error: HTTP.Exceptions.StatusError(400, "POST", "/api/2.0/mlflow/experiments/create", HTTP.Messages.Response:
"""
HTTP/1.1 400 Bad Request
Server: gunicorn
Date: Sun, 24 Sep 2023 19:40:45 GMT
Connection: close
Content-Type: application/json
Content-Length: 90
{"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""")
Stacktrace:
[1] mlfpost(mlf::MLFlow, endpoint::String; kwargs::Base.Pairs{Symbol, Union{Missing, Nothing, String}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:name, :artifact_location, :tags), Tuple{String, Nothing, Missing}}})
@ MLFlowClient ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:74
[2] mlfpost
@ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:66 [inlined]
[3] createexperiment(mlf::MLFlow; name::String, artifact_location::Nothing, tags::Missing)
@ MLFlowClient ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:21
[4] createexperiment
@ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:16 [inlined]
[5] #getorcreateexperiment#7
@ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:103 [inlined]
[6] log_evaluation(logger::MLFlowLogger, performance_evaluation::PerformanceEvaluation
{MLJDecisionTreeInterface.RandomForestClassifier, Vector{LogLoss{Float64}}, Vector{Float64}, Vector{typeof(predict)}, Vector{Vector{Float64}}, Vector{Vector{Vector{Float64}}}, Vector{NamedTuple{(:forest,), Tuple{DecisionTree.Ensemble{Float64, UInt32}}}}, Vector{NamedTuple{(:features,), Tuple{Vector{Symbol}}}}, Holdout})
@ MLJFlow ~/.julia/packages/MLJFlow/TqEtw/src/base.jl:2
[7] evaluate!(mach::Machine{MLJDecisionTreeInterface.RandomForestClassifier, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operations::Vector{typeof(predict)}, acceleration::CPU1{Nothing}, force::Bool, logger::MLFlowLogger, user_resampling::Holdout)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1314
[8] evaluate!(::Machine{MLJDecisionTreeInterface.RandomForestClassifier, true}, ::Holdout, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Vector
{typeof(predict)}, ::CPU1{Nothing}, ::Bool, ::MLFlowLogger, ::Holdout)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1335
[9] fit(::Resampler{Holdout, MLFlowLogger}, ::Int64, ::Tables.MatrixTable{Matrix{Float64}}, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1494
[10] fit_only!(mach::Machine{Resampler{Holdout, MLFlowLogger}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680
[11] fit_only!
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined]
[12] #fit!#63
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined]
[13] fit!
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined]
[14] event!(metamodel::MLJDecisionTreeInterface.RandomForestClassifier, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}, verbosity::Int64, tuning::RandomSearch, history::Nothing, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}})
@ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:443
[15] #46
@ ~/MLJ/MLJTuning/src/tuned_models.jl:597 [inlined]
[16] iterate
@ ./generator.jl:47 [inlined]
[17] _collect(c::Vector{MLJDecisionTreeInterface.RandomForestClassifier}, itr::Base.Generator{Vector{MLJDecisionTreeInterface.RandomForestClassifier}, MLJTuning.var"#46#50"{Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, Channel{Bool}, Vector{Machine{Resampler{Holdout, MLFlowLogger}, false}}, Int64}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
@ Base ./array.jl:802
[18] collect_similar
@ ./array.jl:711 [inlined]
[19] map
@ ./abstractarray.jl:3261 [inlined]
[20] macro expansion
@ ~/MLJ/MLJTuning/src/tuned_models.jl:596 [inlined]
[21] (::MLJTuning.var"#45#49"{Vector{MLJDecisionTreeInterface.RandomForestClassifier}, Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, Channel{Bool}, Vector{Any}, Vector{Machine{Resampler{Holdout, MLFlowLogger}, false}}, UnitRange{Int64}, Int64})()
@ MLJTuning ./threadingconstructs.jl:373
Interestingly, I'm getting the same kind of error for acceleration=Distributed:
using Distributed
addprocs(2)
nprocs()
# 3
using MLJ
using MLFlowClient
logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="rock")
X, y = make_moons()
model = (@iload RandomForestClassifier pkg=DecisionTree)()
r = range(model, :sampling_fraction, lower=0.4, upper=1.0)
tmodel = TunedModel(
model;
range=r,
logger,
acceleration=CPUProcesses(),
n=100,
)
mach = machine(tmodel, X, y) |> fit!;
[ Info: Training machine(ProbabilisticTunedModel(model = RandomForestClassifier(max_depth = -1, …), …), …).
[ Info: Attempting to evaluate 100 models.
From worker 3: ┌ Error: Problem fitting the machine machine(Resampler(model = RandomForestClassifier(max_depth = -1, …), …), …).
From worker 3: └ @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:682
From worker 3: [ Info: Running type checks...
From worker 3: [ Info: Type checks okay.
Evaluating over 100 metamodels: 50%[============> ] ETA: 0:00:15┌ Error: Proble
m fitting the machine machine(ProbabilisticTunedModel(model = RandomForestClassifier(max_depth = -1, …), …), …).
└ @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:682
[ Info: Running type checks...
[ Info: Type checks okay.
ERROR: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:349 [inlined]
[2] fetch
@ ./task.jl:369 [inlined]
[3] preduce(reducer::Function, f::Function, R::Vector{MLJDecisionTreeInterface.RandomForestClassifier})
@ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:274
[4] macro expansion
@ ~/MLJ/MLJTuning/src/tuned_models.jl:521 [inlined]
[5] macro expansion
@ ./task.jl:476 [inlined]
[6] assemble_events!(metamodels::Vector{MLJDecisionTreeInterface.RandomForestClassifier},
resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}, verbosity::Int64, tuning::RandomSearch, history::Nothing, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, acceleration::CPUProcesses{Nothing})
@ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:502
[7] build!(history::Nothing, n::Int64, tuning::RandomSearch, model::MLJDecisionTreeInterface.RandomForestClassifier, model_buffer::Channel{Any}, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, verbosity::Int64, acceleration::CPUProcesses{Nothing}, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false})
@ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:675
[8] fit(::MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, ::Int64, ::Tables.MatrixTable{Matrix{Float64}}, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
@ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:756
[9] fit_only!(mach::Machine{MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
@ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680
[10] fit_only!
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined]
[11] #fit!#63
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined]
[12] fit!
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined]
[13] |>(x::Machine{MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, false}, f::typeof(fit!))
@ Base ./operators.jl:907
[14] top-level scope
@ REPL[16]:1
nested task error: On worker 3:
HTTP.Exceptions.StatusError(400, "POST", "/api/2.0/mlflow/experiments/create", HTTP.Messages.Response:
"""
HTTP/1.1 400 Bad Request
Server: gunicorn
Date: Sun, 24 Sep 2023 20:07:23 GMT
Connection: close
Content-Type: application/json
Content-Length: 89
{"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'rock' already exists."}""")
Stacktrace:
[1] #mlfpost#3
@ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:74
[2] mlfpost
@ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:66 [inlined]
[3] #createexperiment#6
@ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:21
[4] createexperiment
@ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:16 [inlined]
[5] #getorcreateexperiment#7
@ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:103 [inlined]
[6] log_evaluation
@ ~/.julia/packages/MLJFlow/TqEtw/src/base.jl:2
[7] evaluate!
@ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1314
[8] evaluate!
@ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1335
[9] fit
@ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1494
[10] #fit_only!#57
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680
[11] fit_only!
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined]
[12] #fit!#63
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined]
[13] fit!
@ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined]
[14] event!
@ ~/MLJ/MLJTuning/src/tuned_models.jl:443
[15] macro expansion
@ ~/MLJ/MLJTuning/src/tuned_models.jl:522 [inlined]
[16] #39
@ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:288
[17] #invokelatest#2
@ ./essentials.jl:816
[18] invokelatest
@ ./essentials.jl:813
[19] #110
@ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285
[20] run_work_thunk
@ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
[21] macro expansion
@ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [inlined]
[22] #109
@ ./task.jl:514
Stacktrace:
[1] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:465
[2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any})
@ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
[3] #remotecall_fetch#162
@ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[4] remotecall_fetch
@ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[5] (::Distributed.var"#175#176"{typeof(vcat), MLJTuning.var"#39#42"{Machine{Resampler{Holdout, MLFlowLogger}, false}, Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, RemoteChannel{Channel{Bool}}}, Vector{MLJDecisionTreeInterface.RandomForestClassifier}, Vector{UnitRange{Int64}}, Int64, Int64})()
@ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:270
Okay, see here for a MWE: https://github.com/JuliaAI/MLFlowClient.jl/issues/40
Revisiting this issue after a few months.
It looks like the multithreading issue is not likely to be addressed soon. Perhaps we can proceed with this PR, after strictly ruling out logging for the parallel modes. For example, if logger is different from nothing, and either acceleration or acceleration_resampling are different from CPU1(), then clean! resets the accelerations to CPU() and issues a message saying what it has done and why. The clean! code is here.
@pebeto What do you think?
The solution to this issue is not part of the mlflow plans (see https://github.com/mlflow/mlflow/issues/11122). However, a workaround is presented here: https://github.com/JuliaAI/MLJFlow.jl/pull/36 to ensure our process is thread-safe.