MLPreprocessing.jl
MLPreprocessing.jl copied to clipboard
DEPRECATED
This package is deprecated. Please use TableTransforms.jl instead.
MLPreprocessing
| Package Status | Package Evaluator | Build Status |
|---|---|---|
Overview
Utility package that provides end user friendly methods for feature scalings and polynomial
basis expansion. Feature scalings work on Matrix, Vector and DataFrames. It is possible to
have observations stored as columns or rows of a matrix. In order to distinguish between these cases
one can provide the parameter obsdim, where obsdim=1 corresponds to "observations as rows" and
obsdim=2 to "observations as columns". Transformations can be computed on a subset
of columns/rows by defining a vector operate_on.
StandardScaler
Standardization of data sets result in variables with a mean of 0 and variance of 1.
A common use case would be to fit a StandardScaler to the training data and later
apply the same transformation to the test data. StandardScaler is used with the
functions fit(), transform() and fit_transform() as shown below.
fit(StandardScaler, X[, μ, σ; obsdim, operate_on])
fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on])
X : Data of type Matrix or DataFrame.
μ : Vector or scalar describing the translation.
Defaults to mean(X; dims=obsdim)
σ : Vector or scalar describing the scale.
Defaults to std(X; dims=obsdim)
obsdim : Specify which axis corresponds to observations.
Defaults to obsdim=2 (observations are columns of matrix)
For DataFrames obsdim is obsolete and rescaling occurs
column wise.
operate_on: Specify the indices of columns or rows to be centered.
Defaults to all columns/rows.
For DataFrames this must be a vector of symbols, not indices.
E.g. operate_on=[1,3] will perform centering on columns
with index 1 and 3 only (if obsdim=1, else rows 1 and 3)
Note on DataFrames:
Columns containing missing values are skipped.
Columns containing non numeric elements are skipped.
Examples:
Xtrain = rand(100, 4)
Xtest = rand(10, 4)
x = rand(4)
Dtrain = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
Dtest = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
scaler = fit(StandardScaler, Xtrain)
scaler = fit(StandardScaler, Xtrain, obsdim=1)
scaler = fit(StandardScaler, Xtrain, obsdim=1, operate_on=[1,3])
transform(Xtest, scaler)
transform!(Xtest, scaler)
transform(x, scaler)
transform!(x, scaler)
scaler = fit(StandardScaler, Dtrain)
scaler = fit(StandardScaler, Dtrain, operate_on=[:A,:B])
transform(Dtest, scaler)
transform!(Dtest, scaler)
Xscaled, scaler = fit_transform(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
scaler = fit_transform!(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
Note that for transform! the data matrix X has to be of type <: AbstractFloat
as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not
the case for transform however.
For DataFrames transform! can be used on columns of type <: Integer.
FixedRangeScaler
FixedRangeScaler is used with the functions fit(), transform() and fit_transform()
to scale data in a Matrix X or DataFrame to a fixed range [lower:upper].
After fitting a FixedRangeScaler to one data set, it can be used to perform the same
transformation to a new set of data. E.g. fit the FixedRangeScaler to your training
data and then apply the scaling to the test data at a later stage. (See examples below).
fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
X : Data of type Matrix or DataFrame.
lower : (Scalar) Lower limit of new range.
Defaults to 0.
upper : (Scalar) Upper limit of new range.
Defaults to 1.
obsdim : Specify which axis corresponds to observations.
Defaults to obsdim=2 (observations are columns of matrix)
For DataFrames obsdim is obsolete and rescaling occurs
column wise.
operate_on: Specify the indices of columns or rows to be centered.
Defaults to all columns/rows.
For DataFrames this must be a vector of symbols, not indices.
E.g. operate_on=[1,3] will perform centering on columns
with index 1 and 3 only (if obsdim=1, else rows 1 and 3)
Note on DataFrames:
Columns containing NA values are skipped.
Columns containing non numeric elements are skipped.
Examples:
Xtrain = rand(100, 4)
Xtest = rand(10, 4)
x = rand(10)
D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
scaler = fit(FixedRangeScaler, Xtrain)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1, operate_on=[1,3])
scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A,:B])
Xscaled = transform(Xtest, scaler)
transform!(Xtest, scaler)
Xscaled, scaler = fit_transform(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
scaler = fit_transform!(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
Lower Level Functions
The lower level functions on which StandardScaler and FixedRangeScaler are built on can also
be used seperately.
center!()
μ = center!(X[, μ; obsdim, operate_on])
Shift X along obsdim by μ according to X = X - μ
where X is of type Matrix or Vector and D of type DataFrame.
fixedrange!()
lower, upper, xmin, xmax = fixedrange!(X[, lower, upper, xmin, xmax; obsdim, operate_on])
Normalize X or D along obsdim to the interval [lower:upper]
where X is of type Matrix or Vector and D of type DataFrame.
If lower and upper are omitted the default range is [0:1].
standardize!()
μ, σ = standardize!(X[, μ, σ; obsdim, operate_on])
Standardize X along obsdim according to X = (X - μ) / σ.
If μ and σ are omitted they are computed such that variables have a mean of zero.
Polynomial Basis Expansion
M = expand_poly(x[, degree=5, obsdim])
Perform a polynomial basis expansion of the given degree for the vector x.
julia> expand_poly(1:5, degree=3)
3×5 Array{Float64,2}:
1.0 2.0 3.0 4.0 5.0
1.0 4.0 9.0 16.0 25.0
1.0 8.0 27.0 64.0 125.0
julia> expand_poly(1:5, degree=3, obsdim=1)
5×3 Array{Float64,2}:
1.0 1.0 1.0
2.0 4.0 8.0
3.0 9.0 27.0
4.0 16.0 64.0
5.0 25.0 125.0
julia> expand_poly(1:5, 3, ObsDim.First()); # same but type-stable