MLDataUtils.jl icon indicating copy to clipboard operation
MLDataUtils.jl copied to clipboard

Naming Convention for FeatureNormalizer

Open asbisen opened this issue 7 years ago • 1 comments

FeatureNormalizer transforms the matrix X using (X - μ)/σ which translates to StandardScaler in Scikit-Learn. Whereas Normalize method in Scikit-Learn scales the data to a unit norm. I was wondering if we should rename FeatureNormalizer to FeatureStandardizer or something to that effect.

Also is there a reason for having FeatureNormalizer to expect the Matrix such that the features are represented in rows and not columns

And for the last issue I don't know which is correct way Scikit or MLDataUtils. But there is a slight inconsistency between how StandardScaler in Scikit calculates standard deviation vs MLDataUtils. With Scikit they use n to scale the sum and we use n-1 to scale the sum while calculating standard deviation.

Reference: Scikit-Learn Standardize Reference: Scikit-Learn Normalize

asbisen avatar Mar 10 '17 06:03 asbisen

Hi! All good feedback. The FeatureNormalizer is quite old and a little outdated. I will rewrite it at some point. I think it would be a good idea to give consistent results with either Scikit-learn or Caret (R package). Neither of which I checked when I wrote this.

The row vs column thing has to do with Julia's array-memory order, but after a rewrite it will be possible to choose the observation dimension, similar to how LossFunctions allows it

Evizero avatar Mar 10 '17 13:03 Evizero