LPWC icon indicating copy to clipboard operation
LPWC copied to clipboard

Deciding which genes to retain for clustering

Open agitter opened this issue 7 years ago • 1 comments

Following up from #33, in this issue we can discuss how to edit the vignette to provide guidance about which genes to retain for clustering and which genes to remove because they are not differentially expressed over time.

@JohnWSteill wrote in #33:

For my application, I have expression data for all 20k genes, and I'm looking for an "interesting" subset to feed to this algorithm. Filtering out the zero variance helps, but 12k is still too many to play with comfortably.

  • My first filter was to take the highest variance genes, but that was equivalent to just taking the highest expressors, even if the relative change was small.
  • My second was to take the highest CV's, sd/mean, as you suggested Thevaa. But that just picked out the genes with smallest nonzero counts, with noisy large fold changes.
  • My "goldilocks" solution was to pick the genes with the highest sd^2/mean. It seems to pick out the genes with non-noisy largish moves. But there's probably many functions with this balance, I have no argument that its the best one.

Our typical approach with LPWC is to use a statistical test for each gene instead of filtering on functions of the mean and variance. We have used edge, EBSeqHMM, and other packages that support time series designs. We could list these packages or recommend a few specific options in the vignette.

agitter avatar Jul 24 '18 12:07 agitter

#45 added a placeholder in the vignette for suggestions about how to select differentially expressed genes. We'll follow up with suggestions for specific packages.

agitter avatar Aug 07 '18 16:08 agitter