machine-learning
machine-learning copied to clipboard
Multiple comparisons problems
I'm still working my way through the paper published by @gwaygenomics, @allaway and @cgreene, but it made me think of an issue that I believe we should try to deal with in our final product. In the paper they had a specific hypothesis that they tested; however, we are going to provide people with the ability to test out hypotheses on thousands of different mutations.
There are some problems with this ability, such as non-response bias. There are bound to be many uninteresting results (AUROC = 0.5) for different genes that people will tend to glance over. I can very easily imagine a scenario where someone iterates through many different genes until they reach one where a model does a good job at predicting a mutation.
We could approach this issue in a few different ways:
- hold out some data for validation -- only to be used for publication
- apply some sort of correction (e.g. Bonferroni)
- place strong emphasis on effect sizes
- list a clear disclaimer
I wanted to open this issue up so we can discuss the importance of the problem and possible solutions.