VariantSpark
VariantSpark copied to clipboard
AIR - unbiased gini-based importance score - Algorithm
This procedure provides a Gini-based variable importance method that corrects bias for different number of categories (minor-allele-frequency bias in GWAS) and also shows some promising results regarding correlation issues.
the idea is to create pseudo variables for each variable in the dataset by permuting the values of the variables and adding them to the model. Then run the random forest model, and subtract the importance of the pseudo variables from that of the original variable (and by that subtract the bias)
The addition of variables is done only theoretically where in practice no variables are added to the model, saving runtime and memory usage.
A link to the paper: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty373/4994791
The procedure is implemented in R in ranger. Here is an example of how to use it: ranger(data=data,dependent.variable.name = "y", importance = "impurity_corrected")
The "impurity_corrected"
The procedure works as follows:
- Before fitting the RF, a single random reordering of the sample IDs is performed
- instead of sampling mTry variables from {1,...,p} we sample mTry variables from {1,...,2p}
- at a given node in a tree we check which variable and which value to split according to in the regular manner, while if the index sampled (in the subset in step 2) is 1<i<=p we use variable X_(i) as usual and if p+1<i<=2p we use variable X_(i)*= X_(i-p) with the permuted sample ID performed in step 1 (i.e the labels are permuted).
- Calculate mean Gini decrease as usual (Gini importance) I_G for each of the 2p variables. This results in 2p importance scores.
- Calculate the new importance score of X_(i) as AIR(X_i)=I_G(X_(i))-I_G(X_(i)*)
now we want to produce a null distribution for computing p-values. for that we take all the resulting negative AIR importances and mirror the results (creating a set of importances derived from all the negative scores and the absolute values of those scores). We use this set of importance score to compute an Empirical Cumulative Distribution.
We then check the importances of each of the variables to see where it falls in this distribution. this is the p-value for that score. (in simpler words, we have a list of scores resulting from the negatives and the absolute value of the negative, we order this list, and now the p-value of an importance score is its rank in this list divided by the length of the list).
Since the list resulted from the mirroring of the variables (counting all the variables that we not given a score as 0, which can be added to this list) should be very big (roughly the number of variables in the model) we can extract very small p-values from it, which should be enough . if it is still not enough, it might be worth to consider making the procedure twice, and by this double the number of scores in the estimated null distribution (which in this case averaging the importance scores of the two runs for making the results more stable could be done as well)
I'm happy to answer if you have any questions for it as I figure my explanation might not be the best :)
I have implemented a basic version of AIR (with the -ic option).
In addition -icsr option can be used to set the random seed for label permutation so that importance values from mutiple runs can be combined (averaginig).
Some notes:
-
the permutation of labels is really a technical shorthand for permutating variables without actually storing the permutated version (all variable are permutated in the same way). However this introduces extra complexity in split calculation as the samples in current nodes need to be re-maped to the permutated samples. Given this it might be actually better for variant spark to store the permutated values of variables, to speed up the execution.
-
the distribution of AIR importances from VS seem (at least for number of variables around 2000) less symmetrical than the one from Ranger. I suspect this is due to some unifniformity introduced by the way VS selects variables for splits. It should be less of a problem for larger number of variables, but caution is needed before using the Janitza method. (More investigation is need here)
-
computing AIR with different permutations to model null distribution better seems like an interesting idea. The exact procedure is that all the non positive values from all runs should be combined (as they are) to estimate the empirical value of null distribution, and then the average values should be used to estimate confidence levels (possibly only from variables that never were reported with negative importance).
Hi Piotr. Good to hear that you implemented AIR as part of variantSpark. Can't wait to hear of some results. Some responses to your notes:
- I reckon the technical solution of permuting the labels alone is crucial for the way AIR works, firstly and mainly because it saves allot of storage space (as you mentioned) which is a major issue in GWAS but also because it brings better results- If only the labels are permuted for the pseudo variables, it means the correlation between the pseudo variables is similar to the correlation between the original ones, and therefore artifact caused by it are reduced as well. I didn't understand why it required a mapping for every node rather than one main mapping for all the variables (from a variable to its pseudo variable).
- I don't know exactly how VS chooses variables for splits, but if it results in the importances distribution not being symmetric, doesn't it imply some VS specific bias?
- regarding computing AIR multiple times - one of the major points of AIR is that it needs to be executed only one, which saves needed runtime. However, I agree that when possible, executing it few times makes the results more reliable. That said, I would use it a bit differently (and if it works it could even be written in a publication) - I would make an average of the different executions only for those variables that have a positive average, and for the negative importances, instead of averaging them I would use all of them to create the null distribution - this way we are gaining the possibility to inspect p-values that are 3 times smaller than in the original null-distribution.
Great to hear from you guys again, Amnon
On Fri, Aug 2, 2019 at 3:48 AM piotrszul [email protected] wrote:
I have implemented a basic version of AIR (with the -ic option). In addition -icsr option can be used to set the random seed for label permutation so that importance values from mutiple runs can be combined (averaginig).
Some notes:
the permutation of labels is really a technical shorthand for permutating variables without actually storing the permutated version (all variable are permutated in the same way). However this introduces extra complexity in split calculation as the samples in current nodes need to be re-maped to the permutated samples. Given this it might be actually better for variant spark to store the permutated values of variables, to speed up the execution.
the distribution of AIR importances from VS seem (at least for number of variables around 2000) less symmetrical than the one from Ranger. I suspect this is due to some unifniformity introduced by the way VS selects variables for splits. It should be less of a problem for larger number of variables, but caution is needed before using the Janitza method. (More investigation is need here)
computing AIR with different permutations to model null distribution better seems like an interesting idea. The exact procedure is that all the non positive values from all runs should be combined (as they are) to estimate the empirical value of null distribution, and then the average values should be used to estimate confidence levels (possibly only from variables that never were reported with negative importance).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aehrc/VariantSpark/issues/91?email_source=notifications&email_token=ACPCOVOTKI5PLWZQ3NKBSVTQCOG5TA5CNFSM4F3CSLFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3MK2KY#issuecomment-517516587, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPCOVNZWALVD4GZY74JC23QCOG5TANCNFSM4F3CSLFA .
HI Amnon, good to hear from you and thanks for you comments :)
- the main feature of AIR is that all the variables are permutated in the same way (using the same reordering) to create pseudo-variables with correlation structure similar to the real variables (as you mentioned). That can be achieved by creating permutated copies of each variables and then building normal forest on the extended variable set. Whether or not it's feasible for GWAS is debatable - the memory size may be a limitation for a standalone machine but not so much for a cluster (and even if we were to fit say 10K samples by 10M variants in RAM - this is about 100GB of data with one byte per variant, and it's easy to get machined with 512GB of RAM. For a cluster it's easy to get terabytes of RAM is needed).
- permutation of labels is a technical trick to achieve the same as above but without actually instantiating the pseudo-variables (to save space). But there is an extra computational cost to this - access variable data (not only lables) needs now to dereferenced using the ordering. This is how ranger does it:
class DataDouble: public Data {
double get(size_t row, size_t col) const override {
// Use permuted data for corrected impurity importance
size_t col_permuted = col;
if (col >= num_cols) {
col = getUnpermutedVarID(col);
row = getPermutedSampleID(row);
}
if (col < num_cols_no_snp) {
return data[col * num_rows + row];
} else {
return getSnp(row, col, col_permuted);
}
}
an optimization here is to do this dereferencing once per node, as otherwise it will be done per every tested variable that needs it.
- yes I agree that running AIR many time can be a good way to get more precise estimation of null distribution and I that all not positive values from all runs should be used there as they were (without averaging). So there is no difference between your approach and what I described (perhaps imprecisely).
@amnonbleich I am very curious if you have any thought on how well AIR deals with correlation.
As I understand it removes bias due to different types of variables or differences in MAF, but to what extend can it deal variables that vary (significanly) in their correlation structure - say defined by the number of variables they are strongly correlated with.
Intuitively I would think that for informative variables with the same effect the importance for variables with less correlates should be higher than for variables with more correlates, and thus with a single null distribution they would present with different p-values. And I am not sure that the should.