redflag icon indicating copy to clipboard operation
redflag copied to clipboard

Recommend which transform will produce the most Gaussian distribution

Open kwinkunks opened this issue 2 years ago • 1 comments

Could we look at features and targets and recommend suitable nonlinear transformations to make them more amenable to learning?

I think this should work:

scipy.stats.boxcox().

from scipy import stats

xt, lmbda = stats.boxcox(x)

xt is the transformed data, lmbda is the lambda parameter -- the value of lmbda that maximizes the log-likelihood function. The closer it is to 1, the more normal is the distribution. If it's 2, you should square the data, if it's 0.5, take the square root, etcetera.

UPDATE

Box-Cox only works on positive valued data. Turns out there's Yeo-Johnson, which is similar but works on negative data too.

Question: this should probably be done before standardizing the data? Not sure.

Turns out both Box-Cox and Yeo-Johnson are in sklearn too:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

kwinkunks avatar Aug 16 '23 14:08 kwinkunks

:bulb: Will need to consider that Redflag's stdev-based outlier detection won't work on features that need transformation... should apply transformation before deciding on outliers. Probably needs an issue.

kwinkunks avatar Aug 16 '23 14:08 kwinkunks