modeltime icon indicating copy to clipboard operation
modeltime copied to clipboard

Feature request - Refit model using X most important variables

Open vidarsumo opened this issue 4 years ago • 4 comments

I just saw this video from the Catboost team regarding using variable importance to eliminate noicy variables (those who are not important) and by doing this, reducing the error.

Got me thinking if it's possible to implement this in the modeltime package when you refit a model, you could pass the function a number, say 20, to refit the model using only the top 20 variables accorinding to the variable importance calculations.

Does this make sense?

vidarsumo avatar Dec 01 '21 18:12 vidarsumo

We'd need to develop a feature importance capability first, which I do see value in just for model explainability. Especially since XGBoost can provide this, and it's already a dependency to Modeltime.

So let me think about this.

mdancho84 avatar Dec 03 '21 12:12 mdancho84

+1 Would be nice for the smooth extension too.

Steviey avatar Jan 21 '22 22:01 Steviey

++1

Sent from my iPhone www.spsanderson.com

On Jan 21, 2022, at 5:20 PM, John Rambo @.***> wrote:

 +1

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

spsanderson avatar Jan 21 '22 23:01 spsanderson

Hi @vidarsumo ,

You can also try https://github.com/stevenpawley/colino for feature selection. I have been using it in my recipes with quite good results. In fact, if you look at the functions under the step_select_ you can use them to get different scores for your variables and for example get the best score by grouping. Something I usually do is to apply certain lags/differentiations to each of the variables and to keep the one with the best score for each of the "original" variables. (For example, I apply lags to the unemployment rate and see which one has the best score to keep only one of those variables). Then I keep, for example, the two or three best ones and I try different combinations according to the different scores in a worfklowset.

I think that this type of thing is better to go in separate packages because they can have a certain scope and in the end they can be collected in an appropriate way in pre-processing steps, for example through recipes.

It seems more important to me to develop a capacity, for example, like the one mentioned in the issue https://github.com/business-science/modeltime/issues/108

Regards,

AlbertoAlmuinha avatar Aug 07 '22 19:08 AlbertoAlmuinha