MLJLinearModels.jl Comparing`fit!` and `evaluate!` for `LogisticClassifier` with scikit-learn

This issue is closely related to https://github.com/alan-turing-institute/MLJLinearModels.jl/issues/14

I'm trying to compare performance of fit! and evaluate! in Julia and analogous methods in scikit-learn in Python. The code and data are available by the link: https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc

What concerns Julia code, I run it in Windows 10, Julia version 1.6.1 (2021-04-23), MLJ v0.16.4, MLJBase v0.18.6, MLJLinearModels v0.5.4 (and the Python code was run in Python 3.8.6, scikit-learn 0.23.2, numpy 1.19.2).

And the results are as follows:

	Julia	Python
`fit!`/`fit`	`[ Info: Training Machine{LogisticClassifier,…} @116.` `0.211018 seconds (6.75 k allocations: 23.712 MiB)`	`Training took 0.07000088691711426 seconds`
`evaluate!`/`cross_val_score`	`Evaluating over 10 folds: 100%[=========================] Time: 0:00:01` `1.343403 seconds (94.58 k allocations: 232.942 MiB, 0.60% gc time)`	`Scoring took 0.5599837303161621 seconds`

It should be noted that the corresponding Julia code marked with @time (see https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc) was executed several times to exclude compilation time.

And the question I'd like to answer is why MLJ is 2.5-3 times slower than scikit-learn. I tried to make parameters of models in Julia and in Python as close as possible.

But I have not found the way to configure solvers. In Python we have the following list of parameters:

classifier.get_params()
Out[82]: 
{'C': 0.1,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,

Thus, we have here maximal number of iterations, tolerance, etc. And I'm not sure the comparison above is correct, because I do not know what are the values of the respective parameters for LFBGS solver in Julia.

Could you please say how to make sure that all is configured in Julia exactly as it is in Python? And what may be the reason for such performance of Julia? May it be improved somehow? It does not looke as expect on par or better as pointed in https://github.com/alan-turing-institute/MLJLinearModels.jl/issues/14

May 22 '21 17:05 irublev

A first note is that you're comparing MLJ to ScikitLearn, not just MLJLinearModels to ScikitLearn, this will incur additional overheads to do with data handling, MLJ has a more sophisticated treatment of your data than SKL and that takes a bit of time, mostly negligible when you move to larger datasets.

If you want to compare "bare" LogisticClassifier, then you have to directly call MLJLinearModels on the raw floating-point data as a matrix, and do the same on the sklearn side. In doing so, you'll get the results mentioned in #14. Note that while in the experiments we ran we did observe performance gains, this is not very important. If one were to write perfect code in Python for LC, the core of the computation it for non-trivial datasets would boil down to calling NumPy primitives which are not written in python and so are comparable in speed to what you get on the Julia side.

In terms of parameters, the parameters from the classifiers can be adjusted in LogisticRegression (a LC is basically a thresholded LR) (https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/glr/constructors.jl#L117-L126). I'm showing you the code as the formula in the docstring is important if you want to try doing 1-1 comparisons. The main thing to observe is that the convention scikitlearn uses with their penalty parameter (C) is not universal, and we don't use that one. It's the inverse of λ. Again this matters if you want to do 1-1 comparisons, a user would typically get those via hyperparameter tuning. You can see here for a comparison for instance: https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/test/fit/logistic-multinomial.jl#L42-L57

PS: for the python code you should be using LogisticCV but that's a side note.

May 24 '21 09:05 tlienart

@tlienart Thank you very much for your detailed answer. But I'm afraid, some points are missed. In general, I'm not trying to blame MLJ or MLJLinearModels, my point is just to compare performance. And I like MLJ very much and I'm eager to obtain much much better performance for MLJ models than SKL. Please do not think I'm critical of MLJ.

What concerns your point that data handling should be fixed. Yes, I agree. Please look at https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc, I pass now not data frames but only matrices to machine, and in SLK all was passed as matrices initially. What concerns using of "bare" LogisticClassifer, this was done initially in the very first version. The results did not change significantly, all is 2.5-3 times slower in Julia rather than in Python. Thus, all this is still very far from the results mentioned in https://github.com/alan-turing-institute/MLJLinearModels.jl/issues/14
What concerns 1-1 comparison with respect to models, if you look at https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc, you see that in Julia I pass lambda=10 while in Python I pass C=0.1, this was done again in the very first version of the code just to be able to make the comparison fair (I understand what is the standard way to choose the values of model parameters in ML.)
Unfortunately, you have not answered the main question: how to configure solvers? (Not the parameters from the classifiers you have mentioned, but the parameters for the solvers to be used). In this case it is difficult to be sure the comparison is 1-1.
I did not understand why should I use LogisticCV (in fact, LogisticRegressionCV, right?) in the Python code? Are you sure LogisticClassifier from MLJLinearModels corresponds to LogisticRegressionCV in SKL, not LogisticRegression? I'm not quite sure of that.

Could you please say what points are missed by me in https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc (except the parameters for the solver) to make my comparison 1-1? If nothing is missed, what in this case are your results exactly for the same code?

Once again, thank you very much for you kind help.

May 24 '21 10:05 irublev

Thanks, no worries I didn't take your first message badly, just pointing out things that one would need to be careful of when doing comparisons.

For the solver configuration, you can specify it in fit with the solver argument (https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/fit/default.jl#L36-L42). The default solver is LBFGS (https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/fit/default.jl#L23) but you could pass another one or construct your own. The LBFGS solver calls Optim's one as per here: https://github.com/alan-turing-institute/MLJLinearModels.jl/blob/0b2ddf9206662dacb2b9f348f08bce1f85705a16/src/fit/newton.jl#L59-L67.

If you wanted to pass non-default parameters to the LBFGS solver, then you'd have to extend the definition of LBFGS above, and pass the parameters to the Optim.LBFGS(...) call in the relevant _fit. We could add this though in my experience this is not useful. Happy to help you do this if you'd like to explore. The parameters that could be exposed are shown here: https://julianlsolvers.github.io/Optim.jl/stable/#algo/lbfgs/#_top

My point about CV can probably be ignored in this conversation, CV just confuses the comparison and should be dropped. What I was saying is that there's a custom algorithm when you want to do CV with Ridge or LR (which is not implemented here but is implemented in sklearn) but it's a separate point not important here.

May 25 '21 08:05 tlienart

@tlienart I can confirm that I was able to reproduce the performance problem reported by @irublev .

Jun 17 '21 11:06 pgagarinov

Folks, I appreciate there may be performance issues and I'm willing to investigate or help people investigating but in order to do this the reports to be precise...

As a result, could I please ask the following:

use StableRNG and a fixed seed
build a few random matrices n x p with increasing n and p and p < n
build a few random vectors of parameters
generate target with sigmoid(X*beta) thresholded to -1, 1
apply sklearn and barebone MLJLM (no MLJ in the mix, no CV)
compare (a) wallclock time (b) loss function

and run this a few times.

This will help in isolating the issue. Thanks!

Jun 17 '21 14:06 tlienart

Hi Thibaut,

I'm currently using MLJLinearModels for a personnal project where I need to run millions of multinomial regressions. Unfortunately, I also encounter a performance bottleneck as noted above. I have tried to provide here a reproducible benchmark comparing with sklearn called from Julia. I think I have followed your guidelines except for the data generation process as I don't think it matters but happy to change if deemed necessary. I use MLJ but it is only to have access to the models, eg I don't use machines.

I have explored 3 dimensions: Number of samples, number of features and number of categories. It seems there is overall a non negligible difference in performance which unfortunately keeps increasing for large samples and large categories. As you say, I don't think we should try to make better than sklearn as it is C behind the hood but getting the same order of magnitute seems reasonable.

I hope this will help you figure out potential improvements leads if you have time to dedicate to this issue. Also, let me know if I can provide more relevant figures or if I missed something.

# Fit the underlying model
using MLJ
using Random
using StableRNGs
using BenchmarkTools
using DataFrames
using Plots

LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
SKLogisticClassifier = @load LogisticClassifier pkg=ScikitLearn verbosity = 0

mljmodel = LogisticClassifier(lambda=0, fit_intercept=true)
skmodel = SKLogisticClassifier(fit_intercept=true, penalty="none")

results = DataFrame(Id=Int[], P=Int[], N=Int[], NCat=Int[], MLJLTime=Float64[], SKTime=Float64[])
# Experiment 1: Increasing sample size
X = nothing
y = nothing
rng = StableRNG(1234)
p = 4
ncat = 3
for n_samples in [100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
    X = rand(rng, n_samples, p)
    y = categorical(rand(rng, 1:ncat, n_samples))
    time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
    time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
    push!(results, (Id=1, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end

# Experiment 2: Increasing feature dimension
ncat = 3
n_samples = 500_000
for p in [5, 10, 30, 50, 100]
    X = rand(rng, n_samples, p)
    y = categorical(rand(rng, 1:ncat, n_samples))
    time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
    time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
    push!(results, (Id=2, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end

# Experiment 3: Increasing number of categories
p = 4
n_samples = 500_000
for ncat in [2, 3, 4, 5, 6, 7, 8, 9, 10]
    X = rand(rng, n_samples, p)
    y = categorical(rand(rng, 1:ncat, n_samples))
    time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
    time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
    push!(results, (Id=3, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end

# Plotting

first_exp = filter(:Id => ==(1), results)
plot(first_exp.N, Matrix(first_exp[!, [:MLJLTime, :SKTime]]),
    yaxis="Time(s)",
    xaxis="N samples",
    title="Logistic regression performance (p=4, ncat=3)", 
    label=["MLJLinearModels" "SKLearn"])

savefig("nsamples.png")

second_exp = filter(:Id => ==(2), results)
plot(second_exp.P, Matrix(second_exp[!, [:MLJLTime, :SKTime]]),
    yaxis="Time(s)",
    xaxis="N features",
    title="Logistic regression performance (n=500 000, ncat=3)", 
    label=["MLJLinearModels" "SKLearn"])

    
savefig("nfeatures.png")

third_exp = filter(:Id => ==(3), results)
plot(third_exp.NCat, Matrix(third_exp[!, [:MLJLTime, :SKTime]]),
    yaxis="Time(s)",
    xaxis="N categorties",
    title="Logistic regression performance (n=500 000, p=4)", 
    label=["MLJLinearModels" "SKLearn"])

savefig("ncategories.png")

ncategories nfeatures nsamples .

Oct 23 '21 21:10 olivierlabayle

Thanks for doing this analysis carefully, I've done something very similar to you except removing confounders as much as possible (removing MLJ, removing CategoricalArrays, ensuring they use the same loss functions etc.). Reproducible notebook attached is, ran with Julia 1.6.2.

One thing that should be noted is that the way the algorithms are stopped are different which can affect the timings (a bit); but there shouldn't be huge discrepancies. Since MLJLM calls Optim.jl in the background; we could try to expose more of the parameters of Optim.jl.

Note: I ran this on an Intel i7; some of the numbers you show in your graph seem to suggest you have a machine that is up to 3x faster than mine which I find pretty impressive but that's for another discussion.

Exp 1: N increases

p fixed at 4, n_cat at 3 x axis is N (log) y axis is "MLJLM / SKL" first one is time, second one is loss.

Comments:

speedwise, I don't see much to worry about here, it goes towards 1 as N increases which would be to be expected; there's a bump as you can see which would be interesting to investigate around 1e5 - 1e6 but we'd have to run this benchmark multiple times with multiple seeds to verify that it's not just a random bump
losswise, MLJLM is better, not that this matters a lot (but has to be factored in when considering speed)

Exp 2: p increases

n fixed at 100_000, n_cat at 3. x axis is p y axis is "MLJLM / SKL" first one is time, second one is loss.

Comments:

here there does seem to be a persistent overhead but this should be mitigated with the fact that the loss is significantly better for MLJ than for SKL, potentially "too much" so (in that it may be that MLJLM just does more work than is necessary to get "good enough" accuracy)

Exp 3: c increase

p fixed at 4 n fixed at 1e5

x axis is c y axis is "MLJLM / SKL" first one is time, second one is loss.

Comments:

here again, in terms of loss MLJLM might be doing too much work; still though this trend in terms of relative time when the number of categories increases would be worth investigating

What I'm taking from this is that it might be worth investigating along the following lines:

interrupting the search faster / using the same criterion than sklearn
if there are still discrepancies (particularly in terms of c increasing and the ratio being unfavourable to MLJLM), it might be worth investigating whether we can further improve the evaluation of fg! for the MultinomialLoss

benchmark_mljlm.ipynb.zip

Oct 25 '21 12:10 tlienart

Witnessing another pretty serious performance issue on a dataset I'm working with, using multinomial regression. I think both scikit-learn and MLJLinearModels use LBFGS:

MLJScikitLearnInterface: around 5 secs (mean log_loss: 1.0144214569201389)
MLJLinearModels: around 150 secs (mean log_loss: 1.014812901641976)

We are talking 30 folds faster while the final losses are pretty similar. This seems embarassingly huge so please let me know if there is anything wrong with the experiment below.

The data: data_multinomial_perf.csv.zip

I'd be happy to help investigate sensitivity to major hyperparameters and how they might affect loss but there's not much exposed in the API. Also when looking at scikit-learn / Optim I can see that the APIs differ quite a bit so I don't really know if both algorithms are on the same footing. Since this issue has been open for quite some time and is becoming critical for my project, can I ask whether anyone will have bandwidth/skills to look into it ?

Code to reproduce:

using CSV
using DataFrames
using MLJScikitLearnInterface
using MLJLinearModels
using MLJBase
using CategoricalArrays

data = CSV.read("/Users/olivierlabayle/data_multinomial_perf.csv", DataFrame)

X = data[!, ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]]
y = categorical(data.y)

model_mlj = MLJLinearModels.LogisticClassifier(penalty=:none)
fr_mlj = MLJBase.fit(model_mlj, 1, X, y)
ŷ_mlj = MLJBase.predict(model_mlj, fr_mlj[1], X)
mean_ll_mlj = mean(log_loss(ŷ_mlj, y))
@time MLJBase.fit(model_mlj, 1, X, y);

model_skl = MLJScikitLearnInterface.LogisticClassifier(penalty="none", tol=1e-6)
fr_skl = MLJBase.fit(model_skl, 1, X, y)
ŷ_skl = MLJBase.predict(model_skl, fr_skl[1], X)
mean_ll_skl = mean(log_loss(ŷ_skl, y))
@time MLJBase.fit(model_skl, 1, X, y);

Jan 24 '23 16:01 olivierlabayle

Thanks for providing the data; your example is nice and it would be great to open a PR against the docs to discuss how to pass optimizer options.

The code below discards some MLJ stuff to keep things simple but I show further down how to adjust your code. Take aways are:

default optim parameters lead to a long run (as you saw)
passing f_tol parameter to optim leads to similar time as Sklearn
objective function evaluated at parameters in my case (you might get different results with different seeds):
- default LBFGS (80s) objective = 458_663
- LBFGS with f_tol=1e-6 (2s) objective = 458_840
- SKL (2s) objective = 615 536

As explained in the discussion above the point is not to show that MLJ gets to a "better" value (meaningless in this context) but rather that in the same time we get to a comparable objective.

using CSV
using DataFrames
using MLJScikitLearnInterface
using MLJLinearModels
using MLJBase
using CategoricalArrays
using Optim

data = CSV.read("data_multinomial_perf.csv", DataFrame)

X = data[!, ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]]
y = categorical(data.y)

enc(e) = e == "TC" ? 1 : e == "CC" ? 2 : 3

y_enc = [enc(e) for e in y]
X_enc = Matrix(X)

mc = MLJLinearModels.MultinomialRegression(penalty=:none, nclasses=3)
@time theta_1 = MLJLinearModels.fit(mc, X_enc, y_enc)

@time theta_2 = MLJLinearModels.fit(mc, X_enc, y_enc; solver=MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6)))

J = MLJLinearModels.objective(mc, X_enc, y_enc, c=3)
J(theta_1)   # got 458 663
J(theta_2)  # got 458 840

# model_mlj = MLJLinearModels.LogisticClassifier(penalty=:none)
# fr_mlj = MLJBase.fit(model_mlj, 1, X, y)
# ŷ_mlj = MLJBase.predict(model_mlj, fr_mlj[1], X)
# mean_ll_mlj = mean(log_loss(ŷ_mlj, y))
# @time MLJBase.fit(model_mlj, 1, X, y);

model_skl = MLJScikitLearnInterface.LogisticClassifier(penalty="none", tol=1e-6)
fr_skl = MLJBase.fit(model_skl, 1, X, y)
ŷ_skl = MLJBase.predict(model_skl, fr_skl[1], X)
mean_ll_skl = mean(log_loss(ŷ_skl, y))
@time MLJBase.fit(model_skl, 1, X, y);

(fm, _), _ = fr_skl

J(vec(vcat(fm.coef_', fm.intercept_'))) # got 615 536

To adjust your code:

import Optim
model_mlj = MLJLinearModels.MultinomialClassifier(
    penalty=:none,
    nclasses=3,
    solver=MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6))
)

It would be great if you extended the docs a bit to indicate how to do this.

Jan 24 '23 20:01 tlienart

Thanks for the explanation on how to use the Optim package with MLJLinearModels and happy to see that there's a way forward. I've got two further questions:

On my end the above code does raises because the keyword argument definition of LBFGS does not seem to be found. Which version are you using? I am on v0.8.0.

using MLJLinearModels
using Optim
MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6))

ERROR: MethodError: no method matching MLJLinearModels.LBFGS(; optim_options=Optim.Options(x_abstol = 0.0, x_reltol = 0.0, f_abstol = 0.0, f_reltol = 1.0e-6, g_abstol = 1.0e-8, g_reltol = 1.0e-8, outer_x_abstol = 0.0, outer_x_reltol = 0.0, outer_f_abstol = 0.0, outer_f_reltol = 0.0, outer_g_abstol = 1.0e-8, outer_g_reltol = 1.0e-8, f_calls_limit = 0, g_calls_limit = 0, h_calls_limit = 0, allow_f_increases = true, allow_outer_f_increases = true, successive_f_tol = 1, iterations = 1000, outer_iterations = 1000, store_trace = false, trace_simplex = false, show_trace = false, extended_trace = false, show_every = 1, time_limit = NaN, )
)
Closest candidates are:
  MLJLinearModels.LBFGS() at ~/.julia/packages/MLJLinearModels/Goxow/src/fit/solvers.jl:62 got unsupported keyword argument "optim_options"
Stacktrace:
 [1] top-level scope

Do you think it would be worth changing the default Optim f_tol argument, or at least exposing it more explicitely since it seems to be quite important? I am wondering whether a new user will want to take the trouble to tweak the Optim settings. It could be that the default optimization settings are too stringent for what we care about in ML?

Jan 25 '23 10:01 olivierlabayle

There will be a patch release within 15 minutes, looks like we forgot to tag one after exposing the optim parameters.
That would be a sensible PR; sklearn uses 1e-4 as default.

Jan 25 '23 10:01 tlienart

Would you mind giving me access to the repo? I will try to open a PR in the coming week.

Jan 27 '23 10:01 olivierlabayle

MLJLinearModels.jl MLJLinearModels.jl copied to clipboard

Comparing`fit!` and `evaluate!` for `LogisticClassifier` with scikit-learn

Exp 1: N increases

Exp 2: p increases

Exp 3: c increase

MLJLinearModels.jl
MLJLinearModels.jl copied to clipboard