machine-learning
machine-learning copied to clipboard
Median absolute deviation feature selection
@gwaygenomics presented evidence that median absolute deviation (MAD) feature selection (selecting genes with the highest MADs) can eliminate most features without hurting performance: https://github.com/cognoma/machine-learning/pull/18#issuecomment-236265506. In fact, it appears that performance increased with the feature selection, which could make sense if the selection enriched for predictive features, increasing the signal-to-noise ratio.
Therefore, I think we should investigate this method of feature selection further. Specifically, I'm curious whether:
- @gwaygenomics' findings hold true for outcomes other than RAS?
- MAD is better than MAD / median? I think MAD could be biased against selecting genes that are lowly expressed but still variable?
- MAD outperforms random selection of the same feature set size?
- MAD performs well for other algorithms besides logistic regression?
I'm labeling this issue a task, so please investigate if you feel inclined.
In 34225ccdefa191287ca153fc14c73bb4eaa6706d -- an example for classifying TP53 mutation -- we did not apply MAD feature selection (notebook). In a8ae61147897aed4a3883853563b357644cbc5f3 (pull request #25), @yl565 selected the top 500 MAD genes (notebook).
Before MAD feature selection, training AUROC was 95.9% and testing AUROC was 93.5%. After MAD feature selection, training AUROC was 89.9% and testing AUROC was 87.9%. @yl565, did anything else change in your pull request that would negatively affect performance? If not I think we may have an example of 500-MAD genes being detrimental. See @gwaygenomics's analysis for benchmarking on RAS mutations: 500 genes appears to be borderline dangerous.
Since a pipeline has been used, only X_train is used for feature selection and standardization. This would decrease AUROC but I think it reflects the reality better because we want to use the classifier to predict if the gene will mutate for a patient so the X_test in reality is only 1 sample. Using the entire dataset X for feature selection and standardization will cause overfitting. This figure compares the differences in testing AUROC with varying amount of feature selected my MAD
@yl565 really informative analysis. Can you share the source code? Checkout GitHub gists if you want a quick way to host a single notebook. Also I'd love to see the graph extended to all ~20,000 genes.
I'm having some trouble comprehending why performance drops off when you feature select and scale on X_train
only. I wouldn't think our unsupervised selection and scaling would cause overfitting and X_test
is only 10% of the samples. Do you have any insight?
Because there are differences in distribution between training and testing set. This figure shows the genes with the most difference between training and testing data. I guess 7000 samples are not enough to represent the gene variation of the population
Here are the codes: https://gist.github.com/yl565/1a978e358a00dea573590e0456dfc1b2#file-1-tcga-mlexample-effectoffeaturenumbers-ipynb