photon-ml
photon-ml copied to clipboard
[WIP]add the confidence interval computation
this is currently the implementation of ratio modeling for feature selection of random effect in photon. I follow the algorithm described in the original publication but some twists are made according to discussion with yiming and alex.
-
[x] Unit tests all pass.
-
[x] Integration tests all pass.
The algorithm in reality(highly related with codebase instead of only mathematical expression) is as follows:
1.pass in the featureStatisticSummary 2.identify the binomial columns 3.compute the lowerbound for binomial columns based on the t value 4.select the feature based on only the following lowerbound criterion(non-binomial and intercept columns are kept automatically)
if (t < 1) {
T_l = 1 / T_u
}
if (T_l > 1D) {
// select feature
}
As a WIP commit, there are things to polish in near future since we currently focus on the feasibility of this experimental method and try to minimize user-side changes :
-
[x] unit tests not fully covering all scenarios of feature selection. Currently the binomial cases are not selected, we need to craft some data that covering all cases.
-
[x] binomial feature column identification predicate needs to be stronger. Current solution is inherently flawed, we need more computation at feature summary stage to ensure this one.
-
[ ] hyperparameter interface design. for convenience purposes, the user side interface for pass in normal distribution quartile and lowerbound threshold hyperparameter redesign.
-
[x] the relationship with pearson correlation feature selection. We need another parameter to decide on the algorithm of feature selection or mix them in later stage.
-
[x] crafted test data need some change, currently some unneeded feature summary entries are not carefully addressed.
-
[ ] further experiment report and benchmark report after regression tests
-
[x] the way we currently keep non-binary and intercept columns is not good for further feature ranking report planned. need redesign
@joshvfleming @ashelkovnykov
fixed issues but unit tests undone. early update for pending reviews.
pushed with all tests passed and issues resolved.
unit tests and integration tests passed. rebased #390 to this commit.
Should we take the case where x==m && y==n seriously like the original paper did to approximate the result using x = m -0.5, y = n - 0.5 or follow @ashelkovnykov 's opinion that this feature intuitively useless and we should output 0 lowerbound.
@mayiming @joshvfleming
@YazhiGao @joshvfleming @mayiming
I'm set in my position that following the paper 100% is incorrect, due to our specific application:
| X | Y | lower bound |
|---|---|---|
| 0 | 0 | Hardcode to 0 - this feature is never used |
| x | 0 | Impossible since X ⊂ Y |
| m | 0 | Impossible since X ⊂ Y |
| 0 | y | Regular case - need special value so no 'divide by zero' error during variance computation |
| x | y | Regular case |
| m | y | Regular case |
| 0 | n | Impossible since X ⊂ Y |
| x | n | Impossible since X ⊂ Y |
| m | n | Harcode to 0 - the lower bound approaches 1 from the bottom, thus feature will never be selected |
Thus, the code can looks something like this:
if (y == 0 || y == n) {
0
} else {
val x = max(x_raw, X_MIN)
...
}
The X_MIN in the paper is 0.5 - this value seems arbitrary and I wonder whether or not we should decrease it.
@ashelkovnykov is right. The formulation in the paper is between any two ratios, but in our case one set is a subset of the other. So there are cases that can't arise, and because of this, the code can be much simpler.
fixed issues except for test cases redesign, all current tests passed.
ready for experiment now.