MLJLinearModels.jl
MLJLinearModels.jl copied to clipboard
Comparing`fit!` and `evaluate!` for `LogisticClassifier` with scikit-learn
This issue is closely related to https://github.com/alan-turing-institute/MLJLinearModels.jl/issues/14
I'm trying to compare performance of fit!
and evaluate!
in Julia and analogous methods in scikit-learn in Python. The code and data are available by the link: https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc
What concerns Julia code, I run it in Windows 10, Julia version 1.6.1 (2021-04-23), MLJ v0.16.4, MLJBase v0.18.6, MLJLinearModels v0.5.4 (and the Python code was run in Python 3.8.6, scikit-learn 0.23.2, numpy 1.19.2).
And the results are as follows:
Julia | Python | |
---|---|---|
fit! /fit |
[ Info: Training Machine{LogisticClassifier,…} @116. 0.211018 seconds (6.75 k allocations: 23.712 MiB) |
Training took 0.07000088691711426 seconds |
evaluate! /cross_val_score |
Evaluating over 10 folds: 100%[=========================] Time: 0:00:01 1.343403 seconds (94.58 k allocations: 232.942 MiB, 0.60% gc time) |
Scoring took 0.5599837303161621 seconds |
It should be noted that the corresponding Julia code marked with @time
(see https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc) was executed several times to exclude compilation time.
And the question I'd like to answer is why MLJ is 2.5-3 times slower than scikit-learn. I tried to make parameters of models in Julia and in Python as close as possible.
But I have not found the way to configure solvers. In Python we have the following list of parameters:
classifier.get_params()
Out[82]:
{'C': 0.1,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
Thus, we have here maximal number of iterations, tolerance, etc. And I'm not sure the comparison above is correct, because I do not know what are the values of the respective parameters for LFBGS solver in Julia.
Could you please say how to make sure that all is configured in Julia exactly as it is in Python? And what may be the reason for such performance of Julia? May it be improved somehow? It does not looke as expect on par or better
as pointed in https://github.com/alan-turing-institute/MLJLinearModels.jl/issues/14
A first note is that you're comparing MLJ to ScikitLearn, not just MLJLinearModels to ScikitLearn, this will incur additional overheads to do with data handling, MLJ has a more sophisticated treatment of your data than SKL and that takes a bit of time, mostly negligible when you move to larger datasets.
If you want to compare "bare" LogisticClassifier, then you have to directly call MLJLinearModels on the raw floating-point data as a matrix, and do the same on the sklearn side. In doing so, you'll get the results mentioned in #14. Note that while in the experiments we ran we did observe performance gains, this is not very important. If one were to write perfect code in Python for LC, the core of the computation it for non-trivial datasets would boil down to calling NumPy primitives which are not written in python and so are comparable in speed to what you get on the Julia side.
In terms of parameters, the parameters from the classifiers can be adjusted in LogisticRegression
(a LC is basically a thresholded LR) (https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/glr/constructors.jl#L117-L126). I'm showing you the code as the formula in the docstring is important if you want to try doing 1-1 comparisons. The main thing to observe is that the convention scikitlearn uses with their penalty parameter (C
) is not universal, and we don't use that one. It's the inverse of λ
. Again this matters if you want to do 1-1 comparisons, a user would typically get those via hyperparameter tuning. You can see here for a comparison for instance: https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/test/fit/logistic-multinomial.jl#L42-L57
PS: for the python code you should be using LogisticCV but that's a side note.
@tlienart Thank you very much for your detailed answer. But I'm afraid, some points are missed. In general, I'm not trying to blame MLJ or MLJLinearModels, my point is just to compare performance. And I like MLJ very much and I'm eager to obtain much much better performance for MLJ models than SKL. Please do not think I'm critical of MLJ.
- What concerns your point that data handling should be fixed. Yes, I agree. Please look at https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc, I pass now not data frames but only matrices to
machine
, and in SLK all was passed as matrices initially. What concerns using of "bare" LogisticClassifer, this was done initially in the very first version. The results did not change significantly, all is 2.5-3 times slower in Julia rather than in Python. Thus, all this is still very far from the results mentioned in https://github.com/alan-turing-institute/MLJLinearModels.jl/issues/14 - What concerns 1-1 comparison with respect to models, if you look at https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc, you see that in Julia I pass lambda=10 while in Python I pass C=0.1, this was done again in the very first version of the code just to be able to make the comparison fair (I understand what is the standard way to choose the values of model parameters in ML.)
- Unfortunately, you have not answered the main question: how to configure solvers? (Not the parameters from the classifiers you have mentioned, but the parameters for the solvers to be used). In this case it is difficult to be sure the comparison is 1-1.
- I did not understand why should I use
LogisticCV
(in fact,LogisticRegressionCV
, right?) in the Python code? Are you sureLogisticClassifier
fromMLJLinearModels
corresponds toLogisticRegressionCV
in SKL, notLogisticRegression
? I'm not quite sure of that.
Could you please say what points are missed by me in https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc (except the parameters for the solver) to make my comparison 1-1? If nothing is missed, what in this case are your results exactly for the same code?
Once again, thank you very much for you kind help.
Thanks, no worries I didn't take your first message badly, just pointing out things that one would need to be careful of when doing comparisons.
For the solver configuration, you can specify it in fit
with the solver
argument (https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/fit/default.jl#L36-L42). The default solver is LBFGS (https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/fit/default.jl#L23) but you could pass another one or construct your own. The LBFGS solver calls Optim's one as per here: https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/fit/newton.jl#L59-L67.
If you wanted to pass non-default parameters to the LBFGS solver, then you'd have to extend the definition of LBFGS above, and pass the parameters to the Optim.LBFGS(...)
call in the relevant _fit
. We could add this though in my experience this is not useful. Happy to help you do this if you'd like to explore. The parameters that could be exposed are shown here: https://julianlsolvers.github.io/Optim.jl/stable/#algo/lbfgs/#_top
My point about CV can probably be ignored in this conversation, CV just confuses the comparison and should be dropped. What I was saying is that there's a custom algorithm when you want to do CV with Ridge or LR (which is not implemented here but is implemented in sklearn) but it's a separate point not important here.
@tlienart I can confirm that I was able to reproduce the performance problem reported by @irublev .
Folks, I appreciate there may be performance issues and I'm willing to investigate or help people investigating but in order to do this the reports to be precise...
As a result, could I please ask the following:
- use StableRNG and a fixed seed
- build a few random matrices
n x p
with increasingn
andp
andp < n
- build a few random vectors of parameters
- generate target with
sigmoid(X*beta)
thresholded to -1, 1 - apply sklearn and barebone MLJLM (no MLJ in the mix, no CV)
- compare (a) wallclock time (b) loss function
and run this a few times.
This will help in isolating the issue. Thanks!
Hi Thibaut,
I'm currently using MLJLinearModels for a personnal project where I need to run millions of multinomial regressions. Unfortunately, I also encounter a performance bottleneck as noted above. I have tried to provide here a reproducible benchmark comparing with sklearn called from Julia. I think I have followed your guidelines except for the data generation process as I don't think it matters but happy to change if deemed necessary. I use MLJ but it is only to have access to the models, eg I don't use machines.
I have explored 3 dimensions: Number of samples, number of features and number of categories. It seems there is overall a non negligible difference in performance which unfortunately keeps increasing for large samples and large categories. As you say, I don't think we should try to make better than sklearn as it is C behind the hood but getting the same order of magnitute seems reasonable.
I hope this will help you figure out potential improvements leads if you have time to dedicate to this issue. Also, let me know if I can provide more relevant figures or if I missed something.
# Fit the underlying model
using MLJ
using Random
using StableRNGs
using BenchmarkTools
using DataFrames
using Plots
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
SKLogisticClassifier = @load LogisticClassifier pkg=ScikitLearn verbosity = 0
mljmodel = LogisticClassifier(lambda=0, fit_intercept=true)
skmodel = SKLogisticClassifier(fit_intercept=true, penalty="none")
results = DataFrame(Id=Int[], P=Int[], N=Int[], NCat=Int[], MLJLTime=Float64[], SKTime=Float64[])
# Experiment 1: Increasing sample size
X = nothing
y = nothing
rng = StableRNG(1234)
p = 4
ncat = 3
for n_samples in [100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
X = rand(rng, n_samples, p)
y = categorical(rand(rng, 1:ncat, n_samples))
time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
push!(results, (Id=1, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end
# Experiment 2: Increasing feature dimension
ncat = 3
n_samples = 500_000
for p in [5, 10, 30, 50, 100]
X = rand(rng, n_samples, p)
y = categorical(rand(rng, 1:ncat, n_samples))
time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
push!(results, (Id=2, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end
# Experiment 3: Increasing number of categories
p = 4
n_samples = 500_000
for ncat in [2, 3, 4, 5, 6, 7, 8, 9, 10]
X = rand(rng, n_samples, p)
y = categorical(rand(rng, 1:ncat, n_samples))
time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
push!(results, (Id=3, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end
# Plotting
first_exp = filter(:Id => ==(1), results)
plot(first_exp.N, Matrix(first_exp[!, [:MLJLTime, :SKTime]]),
yaxis="Time(s)",
xaxis="N samples",
title="Logistic regression performance (p=4, ncat=3)",
label=["MLJLinearModels" "SKLearn"])
savefig("nsamples.png")
second_exp = filter(:Id => ==(2), results)
plot(second_exp.P, Matrix(second_exp[!, [:MLJLTime, :SKTime]]),
yaxis="Time(s)",
xaxis="N features",
title="Logistic regression performance (n=500 000, ncat=3)",
label=["MLJLinearModels" "SKLearn"])
savefig("nfeatures.png")
third_exp = filter(:Id => ==(3), results)
plot(third_exp.NCat, Matrix(third_exp[!, [:MLJLTime, :SKTime]]),
yaxis="Time(s)",
xaxis="N categorties",
title="Logistic regression performance (n=500 000, p=4)",
label=["MLJLinearModels" "SKLearn"])
savefig("ncategories.png")
.
Thanks for doing this analysis carefully, I've done something very similar to you except removing confounders as much as possible (removing MLJ, removing CategoricalArrays, ensuring they use the same loss functions etc.). Reproducible notebook attached is, ran with Julia 1.6.2.
One thing that should be noted is that the way the algorithms are stopped are different which can affect the timings (a bit); but there shouldn't be huge discrepancies. Since MLJLM calls Optim.jl in the background; we could try to expose more of the parameters of Optim.jl.
Note: I ran this on an Intel i7; some of the numbers you show in your graph seem to suggest you have a machine that is up to 3x faster than mine which I find pretty impressive but that's for another discussion.
Exp 1: N increases
p fixed at 4, n_cat at 3 x axis is N (log) y axis is "MLJLM / SKL" first one is time, second one is loss.
Comments:
- speedwise, I don't see much to worry about here, it goes towards 1 as N increases which would be to be expected; there's a bump as you can see which would be interesting to investigate around 1e5 - 1e6 but we'd have to run this benchmark multiple times with multiple seeds to verify that it's not just a random bump
- losswise, MLJLM is better, not that this matters a lot (but has to be factored in when considering speed)
Exp 2: p increases
n fixed at 100_000, n_cat at 3. x axis is p y axis is "MLJLM / SKL" first one is time, second one is loss.
Comments:
- here there does seem to be a persistent overhead but this should be mitigated with the fact that the loss is significantly better for MLJ than for SKL, potentially "too much" so (in that it may be that MLJLM just does more work than is necessary to get "good enough" accuracy)
Exp 3: c increase
p fixed at 4 n fixed at 1e5
x axis is c y axis is "MLJLM / SKL" first one is time, second one is loss.
Comments:
- here again, in terms of loss MLJLM might be doing too much work; still though this trend in terms of relative time when the number of categories increases would be worth investigating
What I'm taking from this is that it might be worth investigating along the following lines:
- interrupting the search faster / using the same criterion than sklearn
- if there are still discrepancies (particularly in terms of
c
increasing and the ratio being unfavourable to MLJLM), it might be worth investigating whether we can further improve the evaluation offg!
for the MultinomialLoss
Witnessing another pretty serious performance issue on a dataset I'm working with, using multinomial regression. I think both scikit-learn and MLJLinearModels use LBFGS:
- MLJScikitLearnInterface: around 5 secs (mean log_loss: 1.0144214569201389)
- MLJLinearModels: around 150 secs (mean log_loss: 1.014812901641976)
We are talking 30 folds faster while the final losses are pretty similar. This seems embarassingly huge so please let me know if there is anything wrong with the experiment below.
The data: data_multinomial_perf.csv.zip
I'd be happy to help investigate sensitivity to major hyperparameters and how they might affect loss but there's not much exposed in the API. Also when looking at scikit-learn / Optim I can see that the APIs differ quite a bit so I don't really know if both algorithms are on the same footing. Since this issue has been open for quite some time and is becoming critical for my project, can I ask whether anyone will have bandwidth/skills to look into it ?
Code to reproduce:
using CSV
using DataFrames
using MLJScikitLearnInterface
using MLJLinearModels
using MLJBase
using CategoricalArrays
data = CSV.read("/Users/olivierlabayle/data_multinomial_perf.csv", DataFrame)
X = data[!, ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]]
y = categorical(data.y)
model_mlj = MLJLinearModels.LogisticClassifier(penalty=:none)
fr_mlj = MLJBase.fit(model_mlj, 1, X, y)
ŷ_mlj = MLJBase.predict(model_mlj, fr_mlj[1], X)
mean_ll_mlj = mean(log_loss(ŷ_mlj, y))
@time MLJBase.fit(model_mlj, 1, X, y);
model_skl = MLJScikitLearnInterface.LogisticClassifier(penalty="none", tol=1e-6)
fr_skl = MLJBase.fit(model_skl, 1, X, y)
ŷ_skl = MLJBase.predict(model_skl, fr_skl[1], X)
mean_ll_skl = mean(log_loss(ŷ_skl, y))
@time MLJBase.fit(model_skl, 1, X, y);
Thanks for providing the data; your example is nice and it would be great to open a PR against the docs to discuss how to pass optimizer options.
The code below discards some MLJ stuff to keep things simple but I show further down how to adjust your code. Take aways are:
- default optim parameters lead to a long run (as you saw)
- passing
f_tol
parameter to optim leads to similar time as Sklearn - objective function evaluated at parameters in my case (you might get different results with different seeds):
- default LBFGS (80s) objective =
458_663
- LBFGS with
f_tol=1e-6
(2s) objective =458_840
- SKL (2s) objective =
615 536
- default LBFGS (80s) objective =
As explained in the discussion above the point is not to show that MLJ gets to a "better" value (meaningless in this context) but rather that in the same time we get to a comparable objective.
using CSV
using DataFrames
using MLJScikitLearnInterface
using MLJLinearModels
using MLJBase
using CategoricalArrays
using Optim
data = CSV.read("data_multinomial_perf.csv", DataFrame)
X = data[!, ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]]
y = categorical(data.y)
enc(e) = e == "TC" ? 1 : e == "CC" ? 2 : 3
y_enc = [enc(e) for e in y]
X_enc = Matrix(X)
mc = MLJLinearModels.MultinomialRegression(penalty=:none, nclasses=3)
@time theta_1 = MLJLinearModels.fit(mc, X_enc, y_enc)
@time theta_2 = MLJLinearModels.fit(mc, X_enc, y_enc; solver=MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6)))
J = MLJLinearModels.objective(mc, X_enc, y_enc, c=3)
J(theta_1) # got 458 663
J(theta_2) # got 458 840
# model_mlj = MLJLinearModels.LogisticClassifier(penalty=:none)
# fr_mlj = MLJBase.fit(model_mlj, 1, X, y)
# ŷ_mlj = MLJBase.predict(model_mlj, fr_mlj[1], X)
# mean_ll_mlj = mean(log_loss(ŷ_mlj, y))
# @time MLJBase.fit(model_mlj, 1, X, y);
model_skl = MLJScikitLearnInterface.LogisticClassifier(penalty="none", tol=1e-6)
fr_skl = MLJBase.fit(model_skl, 1, X, y)
ŷ_skl = MLJBase.predict(model_skl, fr_skl[1], X)
mean_ll_skl = mean(log_loss(ŷ_skl, y))
@time MLJBase.fit(model_skl, 1, X, y);
(fm, _), _ = fr_skl
J(vec(vcat(fm.coef_', fm.intercept_'))) # got 615 536
To adjust your code:
import Optim
model_mlj = MLJLinearModels.MultinomialClassifier(
penalty=:none,
nclasses=3,
solver=MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6))
)
It would be great if you extended the docs a bit to indicate how to do this.
Thanks for the explanation on how to use the Optim package with MLJLinearModels and happy to see that there's a way forward. I've got two further questions:
- On my end the above code does raises because the keyword argument definition of LBFGS does not seem to be found. Which version are you using? I am on v0.8.0.
using MLJLinearModels
using Optim
MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6))
ERROR: MethodError: no method matching MLJLinearModels.LBFGS(; optim_options=Optim.Options(x_abstol = 0.0, x_reltol = 0.0, f_abstol = 0.0, f_reltol = 1.0e-6, g_abstol = 1.0e-8, g_reltol = 1.0e-8, outer_x_abstol = 0.0, outer_x_reltol = 0.0, outer_f_abstol = 0.0, outer_f_reltol = 0.0, outer_g_abstol = 1.0e-8, outer_g_reltol = 1.0e-8, f_calls_limit = 0, g_calls_limit = 0, h_calls_limit = 0, allow_f_increases = true, allow_outer_f_increases = true, successive_f_tol = 1, iterations = 1000, outer_iterations = 1000, store_trace = false, trace_simplex = false, show_trace = false, extended_trace = false, show_every = 1, time_limit = NaN, )
)
Closest candidates are:
MLJLinearModels.LBFGS() at ~/.julia/packages/MLJLinearModels/Goxow/src/fit/solvers.jl:62 got unsupported keyword argument "optim_options"
Stacktrace:
[1] top-level scope
- Do you think it would be worth changing the default Optim
f_tol
argument, or at least exposing it more explicitely since it seems to be quite important? I am wondering whether a new user will want to take the trouble to tweak the Optim settings. It could be that the default optimization settings are too stringent for what we care about in ML?
- There will be a patch release within 15 minutes, looks like we forgot to tag one after exposing the optim parameters.
- That would be a sensible PR; sklearn uses
1e-4
as default.
Would you mind giving me access to the repo? I will try to open a PR in the coming week.