boruta_py icon indicating copy to clipboard operation
boruta_py copied to clipboard

Question: High Collinearity, how does Boruta handle?

Open GinoWoz1 opened this issue 6 years ago • 2 comments

Hello,

Thanks for the package, I found it quite interesting.

When there are variables that are highly correlated, could that effect the Z-scores?

The only reason why I ask is in the past I have seen groups of highly correlated variables where the variables within that group have varied widely in their importance.

Would it make sense to handle the col-linearity problem before running Boruta?

Sincerely, G

GinoWoz1 avatar Oct 22 '18 16:10 GinoWoz1

https://stats.stackexchange.com/questions/94130/does-boruta-feature-selection-in-r-take-into-account-the-correlation-between-v

danielhomola avatar Oct 22 '18 18:10 danielhomola

Thanks @danielhomola , appreciate taking the time to reply. That answer helps to explain that boruta will always bring up the most important predictors. My question was regarding the robustness of the z-score estimate of the null distribution vs the regular variables; sorry I wasnt clear.

In my case, I have seen random forests with a certain set of features where there are, for example, 2 features that are highly correlated and one of those is # 2 in feature importance while the other feature is dead last. If I removed one other variable that wasn't one of the highly correlated variables, the importance scores would shift around - more so for the highly correlated variables.

My interest is in whether the z- score of the noise distribution would pick up the phenomena above and not count out the groups of highly correlated variables due to the sometimes noisy feature importance scores. I had seen the issue with feature importance scores before and how they could be unreliable at times and found the article at the bottom which helped to explain it a little.

I am still a little ignorant on the boruta method although I have read the paper. Just trying to get a better intuition for how it works and the interaction effects of the random forest feature importance scores ( I realize you can also use other estimators). Thanks for your patience.

Article describing the finickiness of feature importance scores at times

http://explained.ai/rf-importance/index.html

GinoWoz1 avatar Oct 22 '18 19:10 GinoWoz1