MLDataUtils.jl
MLDataUtils.jl copied to clipboard
Naming Convention for FeatureNormalizer
FeatureNormalizer
transforms the matrix X
using (X - μ)/σ
which translates to StandardScaler
in Scikit-Learn. Whereas Normalize
method in Scikit-Learn scales the data to a unit norm. I was wondering if we should rename FeatureNormalizer
to FeatureStandardizer
or something to that effect.
Also is there a reason for having FeatureNormalizer
to expect the Matrix such that the features are represented in rows
and not columns
And for the last issue I don't know which is correct way Scikit
or MLDataUtils
. But there is a slight inconsistency between how StandardScaler
in Scikit
calculates standard deviation vs MLDataUtils
. With Scikit
they use n
to scale the sum and we use n-1
to scale the sum while calculating standard deviation.
Reference: Scikit-Learn Standardize Reference: Scikit-Learn Normalize
Hi! All good feedback. The FeatureNormalizer
is quite old and a little outdated. I will rewrite it at some point. I think it would be a good idea to give consistent results with either Scikit-learn or Caret (R package). Neither of which I checked when I wrote this.
The row
vs column
thing has to do with Julia's array-memory order, but after a rewrite it will be possible to choose the observation dimension, similar to how LossFunctions
allows it