sktime icon indicating copy to clipboard operation
sktime copied to clipboard

[ENH/BUG] FixesErrors when using the recursive strategy in make_reduction when forecasting panel data

Open danbartl opened this issue 1 year ago • 12 comments

Fixes https://github.com/alan-turing-institute/sktime/issues/3316 Introduces a new parameter to allow users to specify whether single models should be fit at the level of instances ("local") or at a global level (single model across instances).

Open question: Should default be local or global?

Status: fully functional PoC

Not yet done: Docs and plausibility checks

danbartl avatar Sep 20 '22 16:09 danbartl

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

I guess default should be better performing option, which might be the global one?

aiwalter avatar Sep 20 '22 17:09 aiwalter

~~@aiwalter changed it so pooling is by default global~~ edit: reworked because defaulting to global (which is a new functionality) instead of local was a breaking change

danbartl avatar Sep 20 '22 18:09 danbartl

yes, I agree with @danbartl - right now we need to keep the default consistent with the status quo. We could deprecate to make the default global though (perhaps not in this PR)

fkiraly avatar Sep 21 '22 21:09 fkiraly

Can we have this in release?

aiwalter avatar Sep 21 '22 22:09 aiwalter

@fkiraly @aiwalter

Included documentation and fixed all tests, from my side ready to merge

danbartl avatar Sep 22 '22 09:09 danbartl

@aiwalter what is the general logic, should the same parameters be explained in all functions and classes where they appear? or only in the upper most / lowest class and function

danbartl avatar Sep 22 '22 11:09 danbartl

Ideally in any class or function, especially in the public classes and functions :) You can just copy&paste

aiwalter avatar Sep 22 '22 11:09 aiwalter

also is which type is transformers expecting or is it just one transformer (that can ofc be a TransformerPipeline?

aiwalter avatar Sep 22 '22 11:09 aiwalter

@aiwalter I add the documentation.

Regarding your question, currently only WindowSummarizer is supported, because it is specifically tailored to the efficient en-bloc approach of having a transformer within make_reduction. The extension to other transformers would require adjustments that I want to avoid before we have a clearly defined vision of the future rework of reducers in preparation by @fkiraly

danbartl avatar Sep 22 '22 13:09 danbartl

ok but then should we not call the param transformer as singular? And it has to be of type BaseTransformer.

aiwalter avatar Sep 22 '22 13:09 aiwalter

Hi we can pass a list of transformers and that works (a list of WindowSummarizers will work - it just does not make sense currently but will also not produce a wrong result). So I think we can keep the list argument for future functionalities (i.e. when more transformers are supported).

danbartl avatar Sep 22 '22 13:09 danbartl

I tried this locally with an example and got an error:

from sktime.datatypes import get_examples
from sktime.forecasting.model_selection import ExpandingWindowSplitter
from sktime.forecasting.compose import make_reduction
from sklearn.ensemble import RandomForestRegressor

y = get_examples(mtype="pd_multiindex_hier")[1]

forecaster = make_reduction(
    RandomForestRegressor(),
    pooling="global",
    window_length=None,
)
forecaster.fit(y)

Error I get is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_32628\2734763291.py in <module>
     18     window_length=None,
     19 )
---> 20 forecaster.fit(y)
     21 # fh=[1,2,3]
     22 # cv = ExpandingWindowSplitter(

c:\Users\mf-wa\Desktop\sktime\sktime\sktime\forecasting\base\_base.py in fit(self, y, X, fh)
    301         # we call the ordinary _fit if no looping/vectorization needed
    302         if not vectorization_needed:
--> 303             self._fit(y=y_inner, X=X_inner, fh=fh)
    304         else:
    305             # otherwise we call the vectorized version of fit

c:\Users\mf-wa\Desktop\sktime\sktime\sktime\forecasting\compose\_reduce.py in _fit(self, y, X, fh)
    534             kwargs = {
    535                 "lag_feature": {
--> 536                     "lag": list(range(1, self.window_length + 1)),
    537                 }
    538             }

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

@danbartl can you please check if I am doing sth wrong?

aiwalter avatar Sep 28 '22 09:09 aiwalter

Imho, we should take into account the learning from my prototype that reimplementing functionality for lagging is not a good idea. I've used Lag inside, separate concerns and complexities is the strategy.

fkiraly avatar Sep 29 '22 20:09 fkiraly

hi i have implemented this to address some shortcomings that existed in the previous classes.

Until your rework is completely finished, wouldnt it make sense to deploy those fixes? Currently the entire functionality is disabled because of your fix to bug 3316.

danbartl avatar Sep 29 '22 20:09 danbartl

Yes, makes sense, @danbartl - let's fix things!

May I ask:

  1. could we separate out the changes to the notebook from the changes to the estimator? I.e., in a separate PR?
  2. can you explain why we are changing a datatype check here? That does not seem right (the _check file)#
  3. I don't think we should change the valid index types and the checks in validation.series, this seems more like a hack and is impacting tests and contracts etc. Could you explain what you are doing there and why?
  4. re pooling, would it not make sense to not do this in the _sliding_window_transform, and always do global pooling? Then control global/local in the class? Seems redundant, because both places do the same.
  5. why do we need the change in evaluate? Should this also be a separate PR, and do we need to test it?
  6. could we add a new test instead of changing the existing test?

@fkiraly

Regarding: 1. the changes depend on each other, so I would like to keep that here. Also the changes are really small, since only pooling = "global" is added Regarging: 2. 3. and 5. : moved to new PR https://github.com/sktime/sktime/pull/3509/, will explain there (currently not described) Regarding 6. no longer required (was left over from when pooling = "global" was default, which as explained was breaking

Regarding 4. I will comment later

danbartl avatar Sep 30 '22 07:09 danbartl

re pooling, would it not make sense to not do this in the _sliding_window_transform, and always do global pooling? Then control global/local in the class? Seems redundant, because both places do the same.

@fkiraly

I do not understand this point. Currently the approach is the following:

  1. Is it a multi index and pooling="global"? => Use pandas notation to be able to use the instance information. Also apply transformers, when given
  2. Is it a multi index and pooling="local"? => Vecotrize and use numpy array. Transformers are not supported.

So I do not understand how we could avoid specifying pooling at class and function level, since due to the multiindex there is a fundamentally different approach for global models.

danbartl avatar Oct 05 '22 06:10 danbartl

re pooling, would it not make sense to not do this in the _sliding_window_transform, and always do global pooling? Then control global/local in the class? Seems redundant, because both places do the same.

I do not understand this point. Currently the approach is the following:

  1. Is it a multi index and pooling="global"? => Use pandas notation to be able to use the instance information. Also apply transformers, when given
  2. Is it a multi index and pooling="local"? => Vecotrize and use numpy array. Transformers are not supported.

So I do not understand how we could avoid specifying pooling at class and function level, since due to the multiindex there is a fundamentally different approach for global models.

The point is, sktime transformers already automatically vectorize under certain circumstances, so I would control vectorization at the tag level (like in my prototype) rather than inside the transformer, this duplicates the vectorization functionality, and also prevents use of transformers.

fkiraly avatar Oct 06 '22 17:10 fkiraly

@fkiraly

  • could you kindly explain why there are changes in the X/y pickle files and what these are? => Frequency got lost in pandas 1.5.0

  • reducer changes look mostly ok for me, althouhg I would prefer that it is tested. E.g., please add at least one pooling="global" case to get_test_params. => As discussed it is tested in test_reduce_global (and now also in test_reduce)

  • it looks like the (untested) "global" case violates sklearn interface assumption and would fail in the tests: estimator parameters should never be changed after init, but transformers is changed in the None case, for instance. can you explain the change in _get_shifted_window from +1 to +2? Why does this change at all, it does not seem to relate to > the content change. => That was a bug that was discovered when I compared model outputs to hardcoded results (like is done in test_reduce.py). It is fixed now, and also tested for.

danbartl avatar Oct 11 '22 16:10 danbartl

I'll leave this open for now in case someone else still wants to review this (following the discussion last week), otherwise I'll move it in when it is time for the next release.

fkiraly avatar Oct 24 '22 16:10 fkiraly