cell_ranger flavor of highly_variable_genes expects non-logarithmized data?
- [X ] I have checked that this issue has not already been reported.
- [ X] I have confirmed this bug exists on the latest version of scanpy.
- [ X] (optional) I have confirmed this bug exists on the master branch of scanpy.
I believe this may be a bug in documentation. It says that scanpy.pp.highly_variable_genes expects logarithmized data, except when flavor='seurat_v3'. However, after reading the reference Zheng17 for the cellRanger method (in particular, Supplementary Figure 5c), it appears that non-logarithmized data was used for calculating the dispersion. And examining the highly_variable_genes source code, I note that for method='seurat', the data is transformed back out of logspace using X=np.expm1(X) before computing dispersions, but this is not done when method='cell_ranger'.
My conclusion is that the documentation should be updated to reflect that when flavor='cell_ranger', non-logarithmized data is expected. But I would very much appreciate clarification on the issue, it has been a long-standing source of confusion in our lab. Thank you.
I'm confused too. The documentation says that flavor ='seurat' or flavor ='cell_ranger' needs logarithmic data. Why the data is transformed back out of logspace using X=np.expm1(X) if flavor='seurat' ? Doesn't this do nothing if expm1(log1p(X))?
Hi, same confusion here.
According to: https://github.com/scverse/scanpy/issues/969#issuecomment-629667682
If I set flavor ='cell_ranger', dose it mean I should not use sc.pp.log1p(adata) to ensure use the "library size normalized counts"(not log)?