machine-learning icon indicating copy to clipboard operation
machine-learning copied to clipboard

Multiple comparisons problems

Open patrick-miller opened this issue 7 years ago • 1 comments

I'm still working my way through the paper published by @gwaygenomics, @allaway and @cgreene, but it made me think of an issue that I believe we should try to deal with in our final product. In the paper they had a specific hypothesis that they tested; however, we are going to provide people with the ability to test out hypotheses on thousands of different mutations.

There are some problems with this ability, such as non-response bias. There are bound to be many uninteresting results (AUROC = 0.5) for different genes that people will tend to glance over. I can very easily imagine a scenario where someone iterates through many different genes until they reach one where a model does a good job at predicting a mutation.

We could approach this issue in a few different ways:

  1. hold out some data for validation -- only to be used for publication
  2. apply some sort of correction (e.g. Bonferroni)
  3. place strong emphasis on effect sizes
  4. list a clear disclaimer

I wanted to open this issue up so we can discuss the importance of the problem and possible solutions.

patrick-miller avatar Feb 14 '17 02:02 patrick-miller