MLJBase.jl
MLJBase.jl copied to clipboard
Learning graph in mach
This is a proposal to extend the learning network API, this should be almost not breaking, it enables caching and other fit
options (like acceleration
) to be passed to submachines.
Related issues: https://github.com/JuliaAI/MLJBase.jl/issues/756
The proposed routine is to declare a learning_network(mach::Machine{<:Composite}, X, y; kwargs...)
method instead of the current fit
for a new composite model. This is examplified with the Stack
but the Pipelines for instance are kept on the current API.
The only caveat is that learning networks defined that way can only be used inside machines.
If the approach looks encouraging to you and you want to push that further, my remaining TODOs are at least:
- [x] Take care of the update mechanism
- [ ] Maybe add a warning/info to encourage users to use the new API? --> This will break some logging tests that I will need to manage.
- [ ] Update/Improve some docstrings:
return!
, ... ? - [ ] I am still not understanding why there is a data field in the cache of a composite model (even when setting
cache=false
) and more generally the anonymization routine since the data is still available anyway. I think @ablaom, you mentionned we could remove that gymnastic in a previous chat. I would be happy to get some guidance and add this to the PR if this is not breaking either.
Please let me know if there are other things I haven't considered.
Codecov Report
Merging #759 (38a576f) into dev (95efa58) will increase coverage by
0.55%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## dev #759 +/- ##
==========================================
+ Coverage 85.41% 85.97% +0.55%
==========================================
Files 36 36
Lines 3435 3486 +51
==========================================
+ Hits 2934 2997 +63
+ Misses 501 489 -12
Impacted Files | Coverage Δ | |
---|---|---|
src/composition/learning_networks/machines.jl | 93.07% <100.00%> (+1.11%) |
:arrow_up: |
src/composition/models/methods.jl | 96.66% <100.00%> (-3.34%) |
:arrow_down: |
src/composition/models/stacking.jl | 94.48% <100.00%> (-0.04%) |
:arrow_down: |
src/machines.jl | 87.50% <100.00%> (+0.83%) |
:arrow_up: |
src/measures/measures.jl | 70.29% <0.00%> (-0.58%) |
:arrow_down: |
src/init.jl | 100.00% <0.00%> (ø) |
|
src/MLJBase.jl | 60.00% <0.00%> (ø) |
|
src/show.jl | 38.27% <0.00%> (+1.23%) |
:arrow_up: |
... and 1 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 95efa58...38a576f. Read the comment docs.
@ablaom I think this is ready for review if you have some time to dedicate to it. As an additional demonstration of the benefits of this PR, you'll see on this branch a multithreaded Stack test. I haven't added that to the PR since I thought this was big enough but happy to take your comments.
@olivierlabayle I'd like to acknowledge this proposal which I haven't reviewed yet. It sounds interesting, but it may take me a couple weeks to look properly at it.
You have made raised a couple of related issues recently (caching, parallelism). Can you help me prioritise my review? Should I look at this one before the others? What are the implications of the current proposal on those others, if any. For example, is possible that implementation of this current proposal could mean the others could be resolved in a non-breaking way?
@ablaom Thank you for getting back to me.
Indeed there are many connecting topics that I am currently trying to address. The two most important ones are memory management and parallelism. This PR is an attempt to partialy address both with a small design change while minimally (one corner case explained below) breaking the API. This is definitely the place to start.
The general idea of the PR is to force the use of machines
for a user to interact (understand fit
, predict
API) with Composite
models. More specifically if one is ready to give up the use of fit(composite_model, verbosity, X, y)
etc... and always resort to fit!(machine(composite_model, X, y))
... then you can forward all fitting keyword arguments to the submachines and access the top machine's internals. This PR examplifies this with the Stack and the forwarding of caching to submachines. In this branch which is based on top of this PR, you can see the further benefit for parallelization with a forwarding of the acceleration
parameter to composite models. You basically have multithreading of composite models for free (provided underlying models are threadsafe of course).
If you are happy with the PR, the last thing I would like to add for now, is the removal of anonymization process since you said it may be avoided. I would be happy to get your guidance on that and the important parts in the code since I didn't understand the use case in the first place.
I hope this helps.
@olivierlabayle Thanks for looking into this and for your patience. I have now reviewed this PR in some detail.
I have to say I really like the idea in this PR, but there are some drawbacks:
-
Composite models don't implement the full MLJ model interface (no
fit(::Model
). This has two important consequences. First, it's breaking.EnsembleModel(model=...)
(MLJEnsembes.jl) andBinaryThresholdPredictor(model=...)
(MLJModels.jl) callfit
on the atomic model, and so won't work with, say,Stack
, as re-imlemented here. Perhaps there are other places. In principle, these could be addressed with some disruption. Secondly, this a seismic shift conceptually. Currently, machines are a higher level abstraction for user-interaction and composite model building, but there is always the possibility of 3rd party packages buying into MLJ models at the lower level, which I think is nice. I believe GeoStats.jl actually does this. We often say "composite models in MLJ behave like any other model". Well, they wouldn't after this PR. -
Rather than
acceleration
being a hyper-parameter of each composite model, which I would find the most natural, it is something apart from the model, and has a global nature: if I callfit!(composite_machine, acceleration=...)
then that particular mode of acceleration is applied to training of the learning network for that composite, but also for any nested composite model. This might be convenient behaviour most of the time, but it is less powerful. For example, at the top level you may want distributed parallelism, while at some lower level you want multithreading or no parallelism. Also, I'm not sure nested Distributed parallelism "just works". Acceleration is already a hyperparameter of some ordinary models (eg, MLJFlux models) and in the model wrappersTunedModel
, andEnsembleModel
. It seems incongruent to have a different interface point forComposite
models.
I've had a think about this, and have an alternative suggestion for dealing with acceleration
which does not have these drawbacks. I realize this does not address #756, but could you take a look at #762 and let me know what you think?
@ablaom Thanks for taking the time to look into this!
- That's definitely true. My original idea was that composition with learning networks is about composing machines not models. If someone wants to use the learning network API they will have to use the machine interface anyway and depend on MLJBase. So, enforcing machines for composition may not be so bad, I appreciate however that this is a conceptual change even though I tried to make it as minimal as possible. By the way, I have tried to run the
Stack
with anEnsembleModel
and it seems to run just fine unless I missed anything.
X, y = make_regression(500, 5)
models = (constant=DeterministicConstantRegressor(),
decisiontree=DecisionTreeRegressor(),
ridge_lambda=FooBarRegressor(;lambda=0.1),
ensemble=EnsembleModel(FooBarRegressor(;lambda=1.)),
ridge=FooBarRegressor(;lambda=0))
mystack = Stack(;metalearner=FooBarRegressor(),
resampling=CV(;nfolds=3),
models...)
mach = machine(mystack, X, y)
fit!(mach, verbosity=0)
The only caveat I see is if someone is actually using fit(mystack, verbosity, X, y)
without a machine
. That, is definitely broken.
- I understand, I had not really though about that as a problem indeed. I think the question boils down to what needs to be inherited by submachines and what doesn't. Maybe the
acceleration
is in the later category and the hyperparameter you suggest is indeed more modular (I need to take a proper look at your proposal since I have seen you added a new field to the Machine, which was not what I originally thought was your intention). Another "in between" option capitalizing on both ideas, would be to makeacceleration
a reserved name and we could give priority to the model hyperparameter if implemented.
Here is a vague illustration to show the idea:
function fit!(mach::Machine; acceleration=CPU1(), kwargs...)
glb_node = glb(mach.args...) # greatest lower bound node of arguments
fit!(glb_node; kwargs...)
acceleration = hasfield(mach.model, "acceleration") ? mach.model.acceleration : acceleration
fit!(glb_node; acceleration=acceleration, kwargs...)
fit_only!(mach; kwargs...)
end
One advantage of this PR's proposed implementation (and the "in between" one) is that a user implementing a learning network does not need to worry about parallelism and still benefit from it, which I think is the most common scenario.
By the way, I have tried to run the Stack with an EnsembleModel and it seems to run just fine unless I missed anything.
No I mean the other way around: EnsembleModel(model=mystack, ...)
.
If someone wants to use the learning network API they will have to use the machine interface anyway and depend on MLJBase.
I think you mean that if a user want's to implement a composite model using the learning network API, then they need to be familiar with machines. That is true. But presently I can use a Stack
model without knowing anything about machines. That is, a user of composite models can view machines as an invisible implementation detail, if they want to.
@olivierlabayle I'm sorry, but after sitting on this a few days, I'm still reluctant to provide, in MLJBase.jl, a facility that encourages composite model definitions that do not meet the MLJ model API. I think this could come back to bite us. I'm also not enthusiastic about fixing the breaking behaviour I've pointed out. Another reason for my reluctance is that the MLJ model composition API is the now the most complex part of the MLJ code base. It is a worthy advancement on other multi-paradigm ML frameworks, but I feel it is reaching the limits of what can be sustainably maintained without a substantial re-design. Putting it another way, I'm afraid we (and I mean "me" predominantly) may be sliding into an "incremental hack".
I think we agree (but correct me if I am wrong) that for acceleration
specifically, my suggestion at #762 is acceptable, but with the modification you suggested there and implement here in the latest commit here. However, as the present PR also includes the new "learning graph" code, I wonder if you would consider a separate PR for that?
The other use-case you mention for introducing the learning graph code is for propagating a fixed C
(cache) value throughout the machines of a learning network. Now cache
is a little different from acceleration
as this has never been considered a model-specific parameter, but only a Machine
one. There "is no" data caching at the level of a model. I wonder if a combination of these two suggestions already made at #756 would suffice, for now:
@olivierlabayle suggested:
- Of course I guess there is the possibility of adding a hyperparameter to the composite model that can be transferred after to the machine definitions. This would probably solve my personnal issue (and would be a short term solution) however I don't think this is ideal in the long term because any user defined composite model will have to add this hyperparameter. (requires no action apart from implementing, at our leisure, this for MLJ composite models
Pipeline
,Stack
, and so forth).
@ablaom suggested:
Another idea is to introduce a user-modifiable global constant to control the default value of
C
in the machine constructor. In fact, there is already such a method foracceleration
and we are introducing such an option forcheck_scitype_level
at https://github.com/JuliaAI/MLJBase.jl/pull/753.
@ablaom I understand your concern, I did not appreciate there was such an emphasis on the fit(model, verbosity, X, y)
API and rather thought about it as an implementation detail used by the public machine API described in the documentation. I think indeed we are reaching the limitations of the current internal design: the composition API draws us towards the use of machines and the MLJModelInterface holds us back from it. In any case, since we haven't reached v1.0 yet, my humble advice would be to make any redesign choice before we are stuck with an unmaintanaible codebase.
What follows is mostly what I would consider to be a hack that is susceptible to much confusion from a user perspective.
That being said, I think indeed both acceleration
and cache
can be added as extra hyperparameters to composite models to tweak their fit
implementation. I can try such a change in a separate PR. Can you elaborate on your alternative global variable suggestion, I am not sure I understood its link with #753 ?
Can you elaborate on your alternative global variable suggestion, I am not sure I understood its link with https://github.com/JuliaAI/MLJBase.jl/pull/753 ?
- In
MLJBase.__init__
add a new global constant - something likeglobal DEFAULT_CACHE = Ref{Bool}(true)
- In src/machines.jl add a function
default_cache
that you will export with two methods. One,default_cache()
to return the value ofDEFAULT_CACHE[]
, and a seconddefault_cache(some_bool)
to change it's state. You can mimic this - In the machine constructor change the default value of
cache
keyword todefault_cache() && caches_by_default(model)
.
Note here that the existing model trait caches_by_default
trait is badly named. Better would be never_caches_data
and we use the negation. It's purpose is to ensure that machines associated with model wrappers (TunedModel
, IteratedModel
, etc) and composite models don't cache data, as there is no need. (That said, I've just noticed this line which I'm guessing is a bug, do you agree?) By the way, the wrappers just mentioned do have a cache
hyper-parameter (as I'm suggesting Stack
include) which is passed on to internally created machines.
Follow up PR: https://github.com/JuliaAI/MLJBase.jl/pull/767