glmpca-py
glmpca-py copied to clipboard
Feature selection using deviance
To help users pre-filter features and remove those that are not informative (so as to speed computation of glmpca), provide the function to compute deviance for each feature and rank the features by highest deviance. The current R implementation is here:
https://github.com/willtownes/scrna2019/blob/master/util/functions_genefilter.R
but the functionality is spread over too many functions and needs to be made more modular, and with improved documentation and tests.
if anyone is interested in this, please note the deviance functions are now available in the scry bioconductor package. Leaving the issue open to remind myself to port it over to python.
I've been fiddling around with implementing the deviances into Python, and I think I landed on a relatively solid foundation wherein I split the count space into chunks to avoid creating gigantic dense arrays. Ran a 110k cell by 19k gene space in about a minute, so not too shabby. Given the fact this repository exists, it would likely make sense to pull request them over? However, I've been writing this with the aim of Scanpy integration, so I've got observations as rows and features as columns.
awesome! As I mentioned in #1 I want to change everything to be row-oriented instead of column-oriented as well. Also, the optimizer is out of date (in the R version we now use Avagrad and it's much faster and more numerically stable). Anyway, the reality is I won't have time to do any of this for several months at best. I don't want to limit your exciting implementation from reaching users faster. I suggest just go ahead and put it directly into scanpy and don't worry too much about integrating with this repository which is increasingly stale. Please do post a link or something here so I can check it out. I may be able to link to it in the readme as well if appropriate.
Hi, I was wondering if there was an update on this. Is it possible to get the full matrix of deviance residuals in the python implementation. Essentially the output of scry::devianceFeatureSelection from the R package and function.
I put in a PR to scanpy, to little interest.
https://github.com/ktpolanski/scanpy/blob/deviantgenes/scanpy/preprocessing/_highly_deviant_genes.py
Thanks for asking, I unfortunately haven't had the bandwidth to make much progress on this repository. I would recommend using the method of @jlause et al which is very similar to the residuals approximation to GLM-PCA. It focuses on Pearson rather than deviance residuals but the difference should not be drastic and both are equally appropriate asymptotically. Here is a scanpy tutorial demonstrating its use.