sktime
sktime copied to clipboard
[ENH/BUG] FixesErrors when using the recursive strategy in make_reduction when forecasting panel data
Fixes https://github.com/alan-turing-institute/sktime/issues/3316 Introduces a new parameter to allow users to specify whether single models should be fit at the level of instances ("local") or at a global level (single model across instances).
Open question: Should default be local or global?
Status: fully functional PoC
Not yet done: Docs and plausibility checks
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
I guess default should be better performing option, which might be the global one?
~~@aiwalter changed it so pooling is by default global~~ edit: reworked because defaulting to global (which is a new functionality) instead of local was a breaking change
yes, I agree with @danbartl - right now we need to keep the default consistent with the status quo. We could deprecate to make the default global though (perhaps not in this PR)
Can we have this in release?
@fkiraly @aiwalter
Included documentation and fixed all tests, from my side ready to merge
@aiwalter what is the general logic, should the same parameters be explained in all functions and classes where they appear? or only in the upper most / lowest class and function
Ideally in any class or function, especially in the public classes and functions :) You can just copy&paste
also is which type is transformers
expecting or is it just one transformer (that can ofc be a TransformerPipeline
?
@aiwalter I add the documentation.
Regarding your question, currently only WindowSummarizer is supported, because it is specifically tailored to the efficient en-bloc approach of having a transformer within make_reduction. The extension to other transformers would require adjustments that I want to avoid before we have a clearly defined vision of the future rework of reducers in preparation by @fkiraly
ok but then should we not call the param transformer
as singular? And it has to be of type BaseTransformer
.
Hi we can pass a list of transformers and that works (a list of WindowSummarizers will work - it just does not make sense currently but will also not produce a wrong result). So I think we can keep the list argument for future functionalities (i.e. when more transformers are supported).
I tried this locally with an example and got an error:
from sktime.datatypes import get_examples
from sktime.forecasting.model_selection import ExpandingWindowSplitter
from sktime.forecasting.compose import make_reduction
from sklearn.ensemble import RandomForestRegressor
y = get_examples(mtype="pd_multiindex_hier")[1]
forecaster = make_reduction(
RandomForestRegressor(),
pooling="global",
window_length=None,
)
forecaster.fit(y)
Error I get is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_32628\2734763291.py in <module>
18 window_length=None,
19 )
---> 20 forecaster.fit(y)
21 # fh=[1,2,3]
22 # cv = ExpandingWindowSplitter(
c:\Users\mf-wa\Desktop\sktime\sktime\sktime\forecasting\base\_base.py in fit(self, y, X, fh)
301 # we call the ordinary _fit if no looping/vectorization needed
302 if not vectorization_needed:
--> 303 self._fit(y=y_inner, X=X_inner, fh=fh)
304 else:
305 # otherwise we call the vectorized version of fit
c:\Users\mf-wa\Desktop\sktime\sktime\sktime\forecasting\compose\_reduce.py in _fit(self, y, X, fh)
534 kwargs = {
535 "lag_feature": {
--> 536 "lag": list(range(1, self.window_length + 1)),
537 }
538 }
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
@danbartl can you please check if I am doing sth wrong?
Imho, we should take into account the learning from my prototype that reimplementing functionality for lagging is not a good idea. I've used Lag
inside, separate concerns and complexities is the strategy.
hi i have implemented this to address some shortcomings that existed in the previous classes.
Until your rework is completely finished, wouldnt it make sense to deploy those fixes? Currently the entire functionality is disabled because of your fix to bug 3316.
Yes, makes sense, @danbartl - let's fix things!
May I ask:
- could we separate out the changes to the notebook from the changes to the estimator? I.e., in a separate PR?
- can you explain why we are changing a datatype check here? That does not seem right (the
_check
file)#- I don't think we should change the valid index types and the checks in
validation.series
, this seems more like a hack and is impacting tests and contracts etc. Could you explain what you are doing there and why?- re pooling, would it not make sense to not do this in the
_sliding_window_transform
, and always do global pooling? Then control global/local in the class? Seems redundant, because both places do the same.- why do we need the change in
evaluate
? Should this also be a separate PR, and do we need to test it?- could we add a new test instead of changing the existing test?
@fkiraly
Regarding: 1. the changes depend on each other, so I would like to keep that here. Also the changes are really small, since only pooling = "global" is added Regarging: 2. 3. and 5. : moved to new PR https://github.com/sktime/sktime/pull/3509/, will explain there (currently not described) Regarding 6. no longer required (was left over from when pooling = "global" was default, which as explained was breaking
Regarding 4. I will comment later
re pooling, would it not make sense to not do this in the _sliding_window_transform, and always do global pooling? Then control global/local in the class? Seems redundant, because both places do the same.
@fkiraly
I do not understand this point. Currently the approach is the following:
- Is it a multi index and pooling="global"? => Use pandas notation to be able to use the instance information. Also apply transformers, when given
- Is it a multi index and pooling="local"? => Vecotrize and use numpy array. Transformers are not supported.
So I do not understand how we could avoid specifying pooling at class and function level, since due to the multiindex there is a fundamentally different approach for global models.
re pooling, would it not make sense to not do this in the _sliding_window_transform, and always do global pooling? Then control global/local in the class? Seems redundant, because both places do the same.
I do not understand this point. Currently the approach is the following:
- Is it a multi index and pooling="global"? => Use pandas notation to be able to use the instance information. Also apply transformers, when given
- Is it a multi index and pooling="local"? => Vecotrize and use numpy array. Transformers are not supported.
So I do not understand how we could avoid specifying pooling at class and function level, since due to the multiindex there is a fundamentally different approach for global models.
The point is, sktime
transformers already automatically vectorize under certain circumstances, so I would control vectorization at the tag level (like in my prototype) rather than inside the transformer, this duplicates the vectorization functionality, and also prevents use of transformers.
@fkiraly
-
could you kindly explain why there are changes in the X/y pickle files and what these are? => Frequency got lost in pandas 1.5.0
-
reducer changes look mostly ok for me, althouhg I would prefer that it is tested. E.g., please add at least one pooling="global" case to get_test_params. => As discussed it is tested in test_reduce_global (and now also in test_reduce)
-
it looks like the (untested) "global" case violates sklearn interface assumption and would fail in the tests: estimator parameters should never be changed after init, but transformers is changed in the None case, for instance. can you explain the change in _get_shifted_window from +1 to +2? Why does this change at all, it does not seem to relate to > the content change. => That was a bug that was discovered when I compared model outputs to hardcoded results (like is done in test_reduce.py). It is fixed now, and also tested for.
I'll leave this open for now in case someone else still wants to review this (following the discussion last week), otherwise I'll move it in when it is time for the next release.