This is a proposal to extend the learning network API, this should be almost not breaking, it enables caching and other fit options (like acceleration) to be passed to submachines.

Related issues: https://github.com/JuliaAI/MLJBase.jl/issues/756

The proposed routine is to declare a learning_network(mach::Machine{<:Composite}, X, y; kwargs...) method instead of the current fit for a new composite model. This is examplified with the Stack but the Pipelines for instance are kept on the current API.

The only caveat is that learning networks defined that way can only be used inside machines.

If the approach looks encouraging to you and you want to push that further, my remaining TODOs are at least:

[x] Take care of the update mechanism
[ ] Maybe add a warning/info to encourage users to use the new API? --> This will break some logging tests that I will need to manage.
[ ] Update/Improve some docstrings: return!, ... ?
[ ] I am still not understanding why there is a data field in the cache of a composite model (even when setting cache=false) and more generally the anonymization routine since the data is still available anyway. I think @ablaom, you mentionned we could remove that gymnastic in a previous chat. I would be happy to get some guidance and add this to the PR if this is not breaking either.

Please let me know if there are other things I haven't considered.

Apr 13 '22 14:04 olivierlabayle

Codecov Report

Merging #759 (38a576f) into dev (95efa58) will increase coverage by 0.55%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #759      +/-   ##
==========================================
+ Coverage   85.41%   85.97%   +0.55%     
==========================================
  Files          36       36              
  Lines        3435     3486      +51     
==========================================
+ Hits         2934     2997      +63     
+ Misses        501      489      -12

Impacted Files	Coverage Δ
src/composition/learning_networks/machines.jl	`93.07% <100.00%> (+1.11%)`	:arrow_up:
src/composition/models/methods.jl	`96.66% <100.00%> (-3.34%)`	:arrow_down:
src/composition/models/stacking.jl	`94.48% <100.00%> (-0.04%)`	:arrow_down:
src/machines.jl	`87.50% <100.00%> (+0.83%)`	:arrow_up:
src/measures/measures.jl	`70.29% <0.00%> (-0.58%)`	:arrow_down:
src/init.jl	`100.00% <0.00%> (ø)`
src/MLJBase.jl	`60.00% <0.00%> (ø)`
src/show.jl	`38.27% <0.00%> (+1.23%)`	:arrow_up:
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 95efa58...38a576f. Read the comment docs.

Apr 14 '22 09:04 codecov-commenter

@ablaom I think this is ready for review if you have some time to dedicate to it. As an additional demonstration of the benefits of this PR, you'll see on this branch a multithreaded Stack test. I haven't added that to the PR since I thought this was big enough but happy to take your comments.

Apr 14 '22 10:04 olivierlabayle

@olivierlabayle I'd like to acknowledge this proposal which I haven't reviewed yet. It sounds interesting, but it may take me a couple weeks to look properly at it.

You have made raised a couple of related issues recently (caching, parallelism). Can you help me prioritise my review? Should I look at this one before the others? What are the implications of the current proposal on those others, if any. For example, is possible that implementation of this current proposal could mean the others could be resolved in a non-breaking way?

Apr 19 '22 03:04 ablaom

@ablaom Thank you for getting back to me.

Indeed there are many connecting topics that I am currently trying to address. The two most important ones are memory management and parallelism. This PR is an attempt to partialy address both with a small design change while minimally (one corner case explained below) breaking the API. This is definitely the place to start.

The general idea of the PR is to force the use of machines for a user to interact (understand fit, predict API) with Composite models. More specifically if one is ready to give up the use of fit(composite_model, verbosity, X, y) etc... and always resort to fit!(machine(composite_model, X, y))... then you can forward all fitting keyword arguments to the submachines and access the top machine's internals. This PR examplifies this with the Stack and the forwarding of caching to submachines. In this branch which is based on top of this PR, you can see the further benefit for parallelization with a forwarding of the acceleration parameter to composite models. You basically have multithreading of composite models for free (provided underlying models are threadsafe of course).

If you are happy with the PR, the last thing I would like to add for now, is the removal of anonymization process since you said it may be avoided. I would be happy to get your guidance on that and the important parts in the code since I didn't understand the use case in the first place.

I hope this helps.

Apr 19 '22 12:04 olivierlabayle

@olivierlabayle Thanks for looking into this and for your patience. I have now reviewed this PR in some detail.

I have to say I really like the idea in this PR, but there are some drawbacks:

Composite models don't implement the full MLJ model interface (no fit(::Model). This has two important consequences. First, it's breaking. EnsembleModel(model=...) (MLJEnsembes.jl) and BinaryThresholdPredictor(model=...) (MLJModels.jl) call fit on the atomic model, and so won't work with, say, Stack, as re-imlemented here. Perhaps there are other places. In principle, these could be addressed with some disruption. Secondly, this a seismic shift conceptually. Currently, machines are a higher level abstraction for user-interaction and composite model building, but there is always the possibility of 3rd party packages buying into MLJ models at the lower level, which I think is nice. I believe GeoStats.jl actually does this. We often say "composite models in MLJ behave like any other model". Well, they wouldn't after this PR.
Rather than acceleration being a hyper-parameter of each composite model, which I would find the most natural, it is something apart from the model, and has a global nature: if I call fit!(composite_machine, acceleration=...) then that particular mode of acceleration is applied to training of the learning network for that composite, but also for any nested composite model. This might be convenient behaviour most of the time, but it is less powerful. For example, at the top level you may want distributed parallelism, while at some lower level you want multithreading or no parallelism. Also, I'm not sure nested Distributed parallelism "just works". Acceleration is already a hyperparameter of some ordinary models (eg, MLJFlux models) and in the model wrappers TunedModel, and EnsembleModel. It seems incongruent to have a different interface point for Composite models.

I've had a think about this, and have an alternative suggestion for dealing with acceleration which does not have these drawbacks. I realize this does not address #756, but could you take a look at #762 and let me know what you think?

May 03 '22 00:05 ablaom

@ablaom Thanks for taking the time to look into this!

That's definitely true. My original idea was that composition with learning networks is about composing machines not models. If someone wants to use the learning network API they will have to use the machine interface anyway and depend on MLJBase. So, enforcing machines for composition may not be so bad, I appreciate however that this is a conceptual change even though I tried to make it as minimal as possible. By the way, I have tried to run the Stack with an EnsembleModel and it seems to run just fine unless I missed anything.

X, y = make_regression(500, 5)

models = (constant=DeterministicConstantRegressor(),
              decisiontree=DecisionTreeRegressor(),
              ridge_lambda=FooBarRegressor(;lambda=0.1),
              ensemble=EnsembleModel(FooBarRegressor(;lambda=1.)),
              ridge=FooBarRegressor(;lambda=0))

mystack = Stack(;metalearner=FooBarRegressor(),
                  resampling=CV(;nfolds=3),
                  models...)
mach = machine(mystack, X, y)
fit!(mach, verbosity=0)

The only caveat I see is if someone is actually using fit(mystack, verbosity, X, y) without a machine. That, is definitely broken.

I understand, I had not really though about that as a problem indeed. I think the question boils down to what needs to be inherited by submachines and what doesn't. Maybe the acceleration is in the later category and the hyperparameter you suggest is indeed more modular (I need to take a proper look at your proposal since I have seen you added a new field to the Machine, which was not what I originally thought was your intention). Another "in between" option capitalizing on both ideas, would be to make acceleration a reserved name and we could give priority to the model hyperparameter if implemented.

Here is a vague illustration to show the idea:

function fit!(mach::Machine; acceleration=CPU1(), kwargs...)
    glb_node = glb(mach.args...) # greatest lower bound node of arguments
    fit!(glb_node; kwargs...)
    acceleration = hasfield(mach.model, "acceleration") ? mach.model.acceleration : acceleration
    fit!(glb_node; acceleration=acceleration, kwargs...)
    fit_only!(mach; kwargs...)
end

One advantage of this PR's proposed implementation (and the "in between" one) is that a user implementing a learning network does not need to worry about parallelism and still benefit from it, which I think is the most common scenario.

May 03 '22 17:05 olivierlabayle

By the way, I have tried to run the Stack with an EnsembleModel and it seems to run just fine unless I missed anything.

No I mean the other way around: EnsembleModel(model=mystack, ...).

May 04 '22 21:05 ablaom

If someone wants to use the learning network API they will have to use the machine interface anyway and depend on MLJBase.

I think you mean that if a user want's to implement a composite model using the learning network API, then they need to be familiar with machines. That is true. But presently I can use a Stack model without knowing anything about machines. That is, a user of composite models can view machines as an invisible implementation detail, if they want to.

May 09 '22 00:05 ablaom

@olivierlabayle I'm sorry, but after sitting on this a few days, I'm still reluctant to provide, in MLJBase.jl, a facility that encourages composite model definitions that do not meet the MLJ model API. I think this could come back to bite us. I'm also not enthusiastic about fixing the breaking behaviour I've pointed out. Another reason for my reluctance is that the MLJ model composition API is the now the most complex part of the MLJ code base. It is a worthy advancement on other multi-paradigm ML frameworks, but I feel it is reaching the limits of what can be sustainably maintained without a substantial re-design. Putting it another way, I'm afraid we (and I mean "me" predominantly) may be sliding into an "incremental hack".

I think we agree (but correct me if I am wrong) that for acceleration specifically, my suggestion at #762 is acceptable, but with the modification you suggested there and implement here in the latest commit here. However, as the present PR also includes the new "learning graph" code, I wonder if you would consider a separate PR for that?

The other use-case you mention for introducing the learning graph code is for propagating a fixed C (cache) value throughout the machines of a learning network. Now cache is a little different from acceleration as this has never been considered a model-specific parameter, but only a Machine one. There "is no" data caching at the level of a model. I wonder if a combination of these two suggestions already made at #756 would suffice, for now:

@olivierlabayle suggested:

Of course I guess there is the possibility of adding a hyperparameter to the composite model that can be transferred after to the machine definitions. This would probably solve my personnal issue (and would be a short term solution) however I don't think this is ideal in the long term because any user defined composite model will have to add this hyperparameter. (requires no action apart from implementing, at our leisure, this for MLJ composite models Pipeline, Stack, and so forth).

@ablaom suggested:

Another idea is to introduce a user-modifiable global constant to control the default value of C in the machine constructor. In fact, there is already such a method for acceleration and we are introducing such an option for check_scitype_level at https://github.com/JuliaAI/MLJBase.jl/pull/753.

May 09 '22 04:05 ablaom

@ablaom I understand your concern, I did not appreciate there was such an emphasis on the fit(model, verbosity, X, y) API and rather thought about it as an implementation detail used by the public machine API described in the documentation. I think indeed we are reaching the limitations of the current internal design: the composition API draws us towards the use of machines and the MLJModelInterface holds us back from it. In any case, since we haven't reached v1.0 yet, my humble advice would be to make any redesign choice before we are stuck with an unmaintanaible codebase.

What follows is mostly what I would consider to be a hack that is susceptible to much confusion from a user perspective.

That being said, I think indeed both acceleration and cache can be added as extra hyperparameters to composite models to tweak their fit implementation. I can try such a change in a separate PR. Can you elaborate on your alternative global variable suggestion, I am not sure I understood its link with #753 ?

May 09 '22 08:05 olivierlabayle

Can you elaborate on your alternative global variable suggestion, I am not sure I understood its link with https://github.com/JuliaAI/MLJBase.jl/pull/753 ?

In MLJBase.__init__ add a new global constant - something like global DEFAULT_CACHE = Ref{Bool}(true)
In src/machines.jl add a function default_cache that you will export with two methods. One, default_cache() to return the value of DEFAULT_CACHE[], and a second default_cache(some_bool) to change it's state. You can mimic this
In the machine constructor change the default value of cache keyword to default_cache() && caches_by_default(model).

Note here that the existing model trait caches_by_default trait is badly named. Better would be never_caches_data and we use the negation. It's purpose is to ensure that machines associated with model wrappers (TunedModel, IteratedModel, etc) and composite models don't cache data, as there is no need. (That said, I've just noticed this line which I'm guessing is a bug, do you agree?) By the way, the wrappers just mentioned do have a cache hyper-parameter (as I'm suggesting Stack include) which is passed on to internally created machines.

May 09 '22 22:05 ablaom

Follow up PR: https://github.com/JuliaAI/MLJBase.jl/pull/767

May 12 '22 16:05 olivierlabayle

MLJBase.jl
MLJBase.jl copied to clipboard

Learning graph in mach

Codecov Report

MLJBase.jl MLJBase.jl copied to clipboard

Learning graph in mach

Codecov Report

MLJBase.jl
MLJBase.jl copied to clipboard