scanpy
scanpy copied to clipboard
highly deviant genes implementation
An implementation of highly deviant gene identification from the 2019 GLMPCA paper. I'm rather fond of the method, as it's a straightforward statistical measure, and comes with significance testing as a form of data-driven cutoff.
I put it in a new highly_deviant_genes()
function, as:
- it comes with a number of unique parameters, and there's only so many different algorithms
highly_variable_genes()
can house - the paper argues that highly deviant is different from highly variable
I acknowledge that there are no tests, I'm hoping to get some assistance with that if possible.
Codecov Report
Merging #1765 (3569f57) into master (560bd5d) will decrease coverage by
0.38%
. The diff coverage is19.27%
.
@@ Coverage Diff @@
## master #1765 +/- ##
==========================================
- Coverage 71.18% 70.80% -0.39%
==========================================
Files 92 93 +1
Lines 11190 11273 +83
==========================================
+ Hits 7966 7982 +16
- Misses 3224 3291 +67
Impacted Files | Coverage Δ | |
---|---|---|
scanpy/preprocessing/_highly_deviant_genes.py | 18.29% <18.29%> (ø) |
|
scanpy/preprocessing/__init__.py | 100.00% <100.00%> (ø) |
I like that this method is fairly simple, and could have a meaningful cutoff, but I think I'd like more evidence of it's usefulness before thinking about including it.
I have two main points of concern:
- Are there examples of this method being used outside of the glmPCA paper? I would at least like to know that reasonable results can be found downstream of this.
- In the glmPCA paper, the identified genes are highly correlated (~1) with highly expressed genes, and lowly correlated (~.3 with highly variable gene selection. While I'm not sure which highly variable gene method they compared against, should the low correlation with common practice give us pause?

@giovp