scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

highly deviant genes implementation

Open ktpolanski opened this issue 3 years ago • 2 comments

An implementation of highly deviant gene identification from the 2019 GLMPCA paper. I'm rather fond of the method, as it's a straightforward statistical measure, and comes with significance testing as a form of data-driven cutoff.

I put it in a new highly_deviant_genes() function, as:

  • it comes with a number of unique parameters, and there's only so many different algorithms highly_variable_genes() can house
  • the paper argues that highly deviant is different from highly variable

I acknowledge that there are no tests, I'm hoping to get some assistance with that if possible.

ktpolanski avatar Mar 26 '21 10:03 ktpolanski

Codecov Report

Merging #1765 (3569f57) into master (560bd5d) will decrease coverage by 0.38%. The diff coverage is 19.27%.

@@            Coverage Diff             @@
##           master    #1765      +/-   ##
==========================================
- Coverage   71.18%   70.80%   -0.39%     
==========================================
  Files          92       93       +1     
  Lines       11190    11273      +83     
==========================================
+ Hits         7966     7982      +16     
- Misses       3224     3291      +67     
Impacted Files Coverage Δ
scanpy/preprocessing/_highly_deviant_genes.py 18.29% <18.29%> (ø)
scanpy/preprocessing/__init__.py 100.00% <100.00%> (ø)

codecov[bot] avatar Mar 26 '21 10:03 codecov[bot]

I like that this method is fairly simple, and could have a meaningful cutoff, but I think I'd like more evidence of it's usefulness before thinking about including it.

I have two main points of concern:

  • Are there examples of this method being used outside of the glmPCA paper? I would at least like to know that reasonable results can be found downstream of this.
  • In the glmPCA paper, the identified genes are highly correlated (~1) with highly expressed genes, and lowly correlated (~.3 with highly variable gene selection. While I'm not sure which highly variable gene method they compared against, should the low correlation with common practice give us pause?
image

@giovp

ivirshup avatar Mar 30 '21 03:03 ivirshup