Support feature selection
Feature selection
Feature selection is the process of selecting a subset consisting of influential features from multiple features. It is an important technique to enhance results, shorten training time and make features human-understandable.
Currently, following is temporary I/F
Candidates for internal selecting methods
- chi2(for non-negative data only)
- SNR
- ~~mRMR~~
Common
[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>
Input
| array |
array |
|---|---|
| a row of matrix | a row of matrix |
Output
| array<array |
|---|
dot(X.T, Y), shape = (X.#cols, Y.#cols) |
[UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
Input
| array |
array |
int k |
|---|---|---|
| array | the larger, the more important | top-? |
Output
| array<array |
|---|
| top-k elements from X based on indices of importance list |
/***********************************************************************
Note
- Current implementation expects _ALL each
importance_listandkare equal_. It maybe confuse us.- Future WA: add option showing use of common
importance_listandk
- Future WA: add option showing use of common
***********************************************************************/
chi2
[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>
Input
both observed and expected, shape = (#classes, #features)
| array |
array |
|---|---|
| observed features | expected features, dot(class_prob.T, feature_count) |
Output
| struct<array |
|---|
| chi2-values and p-values each feature, each shape = (1, #features) |
Example - chi2
CREATE TABLE input (
X array<double>, -- features
Y array<int> -- binarized label
);
WITH stats AS (
SELECT
-- [UDAF] transpose_and_dot(Y::array<number>, X::array<number>)::array<array<double>>
transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
FROM
input
),
test AS (
SELECT
transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
FROM
stats
),
chi2 AS (
SELECT
-- [UDAF] chi2(observed::array<array<double>>, expected::array<array<double>>)::struct<array<double>, array<double>>
chi2(observed, expected) AS chi2s -- struct<array<double>, array<double>>, each shape = (1, n_features)
FROM
test JOIN stats;
)
SELECT
-- [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
select_k_best(X, chi2s.chi2, 2) -- top-2 feature selection based on chi2 score
FROM
input JOIN chi2;
SNR
[UDAF] snr(X::array<number>, Y::array<int>)::array<double>
Input
| array |
array |
|---|---|
| a row of matrix, overall shape = (#samples, #features) | a row of one-hot matrix, overall shape = (#samples, #classes) |
Output
| array |
|---|
| snr values of each feature, shape = (1, #features) |
Note
- There is no need to one-hot vectorize, but fitting its interface to chi2's one
Example - snr
CREATE TABLE input (
X array<double>, -- features
Y array<int> -- binarized label
);
WITH snr AS (
-- [UDAF] snr(features::array<number>, labels::array<int>)::array<double>
SELECT snr(X, Y) AS snr FROM input -- aggregated SNR as array<double>, shape = (1, #features)
)
SELECT select_k_best(X, snr, ${k}) FROM input JOIN snr;
@amaya382 depends on input format of each algorithm. If we can use same format as the input, better to provide a feature_selection function w/ an algorithm option.
Also, how to apply distributed processing is another issue. mRMR is processed in parallel by MapReduce in some implementation.
@amaya382 we can do better as follows:
X::relation(array<double>) # n_features * n_examples matrix
Y::relation(array<int>) # n_class * n_examples matrix
observed = dotp(Y.T, X) # n_classes * n_features matrix
UDAF transpose_and_dot(Y, X)::array<array<double>>
feature_count = X.sum # n_features col vector
UDAF array_sum(X)::array<double>
class_prob = Y.mean # n_class col vector
UDAF array_avg(Y)::array<double>
expected = dotp(class_prob.T, feature_count) # n_class * n_features matrix
UDAF transpose_and_dot(class_prob, feature_count)::array<array<double>>
chi2,pval = chisquare(observed, expected) # n_class * n_features => n_features
UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>>
chi2 # n_features col vector
pval # n_features col vector
create table input (
X array<double>, -- features
Y array<int> -- binarized label
);
WITH stats (
select
-- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
transpose_and_dot(Y, X) as observed, -- array<array<double>> # n_classes * n_features matrix
array_sum(X) as feature_count, -- n_features col vector # array<double>
array_avg(Y) as class_prob -- n_class col vector # array<double>
from
input
),
test as (
select
observed,
-- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
transpose_and_dot(class_prob, feature_count) as expected -- array<array<double>> # n_class * n_features matrix
from
stats
),
select
-- UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>>
chi2(observed, expected) as (chi2, pval) -- struct<array<double>,array<double>> # n_features
from
test;
select
select_k_best(X, T.chi2, ${k}) as X, -- feature selection based on chi2 score
Y
from
input
CROSS JOIN chi2 T
;
What you have to develop is just transpose_and_dot UDAF and chi2 UDAF.
transpose_and_dot is bits tricky but can be incrementally implemented.
@myui okay, I'll try this strategy
@amaya382 could you validate your implementation to other chi2 implementation (e.g., scikit-learn) in the unit and system test?
@myui
chi2 (Iris dataset)
results by hivemall with systemtest, which exec query actually
| f0 | f1 | f2 | f3 | |
|---|---|---|---|---|
| chi2_vals | 10.817820878493995 | 3.5944990176817315 | 116.16984746363957 | 67.24482558215503 |
| p_vals | 0.004476514990225833 | 0.16575416718561453 | 0.0 | 2.55351295663786E-15 |
results by scikit-learn
| f0 | f1 | f2 | f3 | |
|---|---|---|---|---|
| chi2_vals | 10.81782088 | 3.59449902 | 116.16984746 | 67.24482759 |
| p_vals | 4.47651499e-03 | 1.65754167e-01 | 5.94344354e-26 | 2.50017968e-15 |
@myui
SNR (Iris dataset)
results by hivemall with systemtest, which exec query actually, incremental algorithm
| f0 | f1 | f2 | f3 | |
|---|---|---|---|---|
| aggregated SNR | 3.2700426804423826 | 1.9034599450433007 | 11.359577906085278 | 9.794847727147216 |
results by python-numpy, batch algorithm
| f0 | f1 | f2 | f3 | |
|---|---|---|---|---|
| aggregated SNR | 3.27004268 | 1.90345995 | 11.35957791 | 9.79484773 |
Also already tested on EMR, worked properly