juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

Predict the magnitude of #TidyTuesday tornadoes with effect encoding and xgboost | Julia Silge

Open utterances-bot opened this issue 1 year ago • 7 comments

Predict the magnitude of #TidyTuesday tornadoes with effect encoding and xgboost | Julia Silge

A data science blog

https://juliasilge.com/blog/tornadoes/

utterances-bot avatar Jun 15 '23 13:06 utterances-bot

Hello Julia,

Thanks for another informative post. Your method of handling high cardinality categorical variables through likelihood encoding was interesting.

I noticed that 'st' variable is a top contributor to the model. However, the encoding adds a degree of abstraction. I am trying to interpret effects of specific states on the tornado magnitude. Can we somehow map these encoded 'st' values back to the original states for more intuitive interpretation? Could referring to encoded st values themselves provide a straightforward way to understand their effects?

Moreover, I am pondering if PDP could be used to further explore the effects of each state.

Thanks again for your insightful post. Looking forward to more of it.

msahil515 avatar Jun 15 '23 13:06 msahil515

@msahil515 Yes, you can get out the values associated with each value for st by tidying the recipe. Check out how I do that in this similar post -- look for tidy().

You could also use a partial dependence profile to examine the results more. I like using model_profile() from DALEX, as shown here.

juliasilge avatar Jun 15 '23 15:06 juliasilge

Hello Julia

I would like to use a different encoding method for categorical variables, similar to the internal pca ordering method used by ranger (adapted from Coppersmith). It is target based and so needs to be done on each fold, rather than prior to splitting the data. How would I be able to incorporate this into a recipe step please?

Many thanks!

smithhelen avatar Jun 27 '23 22:06 smithhelen

@smithhelen Take a look at this article on how to create your own recipe step.

juliasilge avatar Jul 02 '23 21:07 juliasilge

Hello Julia, congrats for your impressive work.

I have a question about the grid in tune_race_anova(). The grid is the total number of combinations of the levels of trees, min_n, and mtry? Or for each of these hyperparameters, it will be considered 15 levels and the total grid will have 15^3?

Thank you.

robsonpro avatar Jul 08 '23 19:07 robsonpro

@robsonpro Ah no, if you set grid = 15, the way it works is to choose a grid_max_entropy() with 15 elements total. You can read more about this kind of behavior in this chapter, and especially this section. Notice where it says:

The default design used by the tune package is the maximum entropy design.

You can provide your own grid in that argument, using any of the kinds of grid specifications outlined in that chapter. If you use the default or do something like grid = 10, it will do a maximum entropy grid with 10 elements.

juliasilge avatar Jul 08 '23 22:07 juliasilge

Thank you so much for your attention and explanation, @juliasilge. I catch that now.

robsonpro avatar Jul 11 '23 10:07 robsonpro