photon-ml [WIP]add the confidence interval computation

this is currently the implementation of ratio modeling for feature selection of random effect in photon. I follow the algorithm described in the original publication but some twists are made according to discussion with yiming and alex.

[x] Unit tests all pass.
[x] Integration tests all pass.

The algorithm in reality(highly related with codebase instead of only mathematical expression) is as follows:

1.pass in the featureStatisticSummary 2.identify the binomial columns 3.compute the lowerbound for binomial columns based on the t value 4.select the feature based on only the following lowerbound criterion(non-binomial and intercept columns are kept automatically)

if (t < 1) {
  T_l = 1 / T_u
}

if (T_l > 1D) {
  //  select feature
}

As a WIP commit, there are things to polish in near future since we currently focus on the feasibility of this experimental method and try to minimize user-side changes :

[x] unit tests not fully covering all scenarios of feature selection. Currently the binomial cases are not selected, we need to craft some data that covering all cases.
[x] binomial feature column identification predicate needs to be stronger. Current solution is inherently flawed, we need more computation at feature summary stage to ensure this one.
[ ] hyperparameter interface design. for convenience purposes, the user side interface for pass in normal distribution quartile and lowerbound threshold hyperparameter redesign.
[x] the relationship with pearson correlation feature selection. We need another parameter to decide on the algorithm of feature selection or mix them in later stage.
[x] crafted test data need some change, currently some unneeded feature summary entries are not carefully addressed.
[ ] further experiment report and benchmark report after regression tests
[x] the way we currently keep non-binary and intercept columns is not good for further feature ranking report planned. need redesign

@joshvfleming @ashelkovnykov

Aug 14 '18 18:08 YazhiGao

fixed issues but unit tests undone. early update for pending reviews.

Aug 23 '18 17:08 YazhiGao

pushed with all tests passed and issues resolved.

Aug 27 '18 22:08 YazhiGao

unit tests and integration tests passed. rebased #390 to this commit.

Sep 07 '18 22:09 YazhiGao

Should we take the case where x==m && y==n seriously like the original paper did to approximate the result using x = m -0.5, y = n - 0.5 or follow @ashelkovnykov 's opinion that this feature intuitively useless and we should output 0 lowerbound. @mayiming @joshvfleming

Sep 07 '18 22:09 YazhiGao

@YazhiGao @joshvfleming @mayiming

I'm set in my position that following the paper 100% is incorrect, due to our specific application:

X	Y	lower bound
0	0	Hardcode to 0 - this feature is never used
x	0	Impossible since X ⊂ Y
m	0	Impossible since X ⊂ Y
0	y	Regular case - need special value so no 'divide by zero' error during variance computation
x	y	Regular case
m	y	Regular case
0	n	Impossible since X ⊂ Y
x	n	Impossible since X ⊂ Y
m	n	Harcode to 0 - the lower bound approaches 1 from the bottom, thus feature will never be selected

Thus, the code can looks something like this:

if (y == 0 || y == n) {
  0
} else {
  val x = max(x_raw, X_MIN)
  ...
}

The X_MIN in the paper is 0.5 - this value seems arbitrary and I wonder whether or not we should decrease it.

Sep 10 '18 19:09 ashelkovnykov

@ashelkovnykov is right. The formulation in the paper is between any two ratios, but in our case one set is a subset of the other. So there are cases that can't arise, and because of this, the code can be much simpler.

Sep 10 '18 20:09 joshvfleming

fixed issues except for test cases redesign, all current tests passed.

Sep 17 '18 17:09 YazhiGao

ready for experiment now.

Sep 19 '18 20:09 YazhiGao

photon-ml photon-ml copied to clipboard

[WIP]add the confidence interval computation

photon-ml
photon-ml copied to clipboard