photon-ml icon indicating copy to clipboard operation
photon-ml copied to clipboard

[WIP]add the confidence interval computation

Open YazhiGao opened this issue 7 years ago • 8 comments

this is currently the implementation of ratio modeling for feature selection of random effect in photon. I follow the algorithm described in the original publication but some twists are made according to discussion with yiming and alex.

  • [x] Unit tests all pass.

  • [x] Integration tests all pass.

The algorithm in reality(highly related with codebase instead of only mathematical expression) is as follows:

1.pass in the featureStatisticSummary 2.identify the binomial columns 3.compute the lowerbound for binomial columns based on the t value 4.select the feature based on only the following lowerbound criterion(non-binomial and intercept columns are kept automatically)

if (t < 1) {
  T_l = 1 / T_u
}

if (T_l > 1D) {
  //  select feature
}

As a WIP commit, there are things to polish in near future since we currently focus on the feasibility of this experimental method and try to minimize user-side changes :

  • [x] unit tests not fully covering all scenarios of feature selection. Currently the binomial cases are not selected, we need to craft some data that covering all cases.

  • [x] binomial feature column identification predicate needs to be stronger. Current solution is inherently flawed, we need more computation at feature summary stage to ensure this one.

  • [ ] hyperparameter interface design. for convenience purposes, the user side interface for pass in normal distribution quartile and lowerbound threshold hyperparameter redesign.

  • [x] the relationship with pearson correlation feature selection. We need another parameter to decide on the algorithm of feature selection or mix them in later stage.

  • [x] crafted test data need some change, currently some unneeded feature summary entries are not carefully addressed.

  • [ ] further experiment report and benchmark report after regression tests

  • [x] the way we currently keep non-binary and intercept columns is not good for further feature ranking report planned. need redesign

@joshvfleming @ashelkovnykov

YazhiGao avatar Aug 14 '18 18:08 YazhiGao

fixed issues but unit tests undone. early update for pending reviews.

YazhiGao avatar Aug 23 '18 17:08 YazhiGao

pushed with all tests passed and issues resolved.

YazhiGao avatar Aug 27 '18 22:08 YazhiGao

unit tests and integration tests passed. rebased #390 to this commit.

YazhiGao avatar Sep 07 '18 22:09 YazhiGao

Should we take the case where x==m && y==n seriously like the original paper did to approximate the result using x = m -0.5, y = n - 0.5 or follow @ashelkovnykov 's opinion that this feature intuitively useless and we should output 0 lowerbound. @mayiming @joshvfleming

YazhiGao avatar Sep 07 '18 22:09 YazhiGao

@YazhiGao @joshvfleming @mayiming

I'm set in my position that following the paper 100% is incorrect, due to our specific application:

X Y lower bound
0 0 Hardcode to 0 - this feature is never used
x 0 Impossible since X ⊂ Y
m 0 Impossible since X ⊂ Y
0 y Regular case - need special value so no 'divide by zero' error during variance computation
x y Regular case
m y Regular case
0 n Impossible since X ⊂ Y
x n Impossible since X ⊂ Y
m n Harcode to 0 - the lower bound approaches 1 from the bottom, thus feature will never be selected

Thus, the code can looks something like this:

if (y == 0 || y == n) {
  0
} else {
  val x = max(x_raw, X_MIN)
  ...
}

The X_MIN in the paper is 0.5 - this value seems arbitrary and I wonder whether or not we should decrease it.

ashelkovnykov avatar Sep 10 '18 19:09 ashelkovnykov

@ashelkovnykov is right. The formulation in the paper is between any two ratios, but in our case one set is a subset of the other. So there are cases that can't arise, and because of this, the code can be much simpler.

joshvfleming avatar Sep 10 '18 20:09 joshvfleming

fixed issues except for test cases redesign, all current tests passed.

YazhiGao avatar Sep 17 '18 17:09 YazhiGao

ready for experiment now.

YazhiGao avatar Sep 19 '18 20:09 YazhiGao