scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

cell_ranger flavor of highly_variable_genes expects non-logarithmized data?

Open mjhubisz opened this issue 5 years ago • 1 comments

  • [X ] I have checked that this issue has not already been reported.
  • [ X] I have confirmed this bug exists on the latest version of scanpy.
  • [ X] (optional) I have confirmed this bug exists on the master branch of scanpy.

I believe this may be a bug in documentation. It says that scanpy.pp.highly_variable_genes expects logarithmized data, except when flavor='seurat_v3'. However, after reading the reference Zheng17 for the cellRanger method (in particular, Supplementary Figure 5c), it appears that non-logarithmized data was used for calculating the dispersion. And examining the highly_variable_genes source code, I note that for method='seurat', the data is transformed back out of logspace using X=np.expm1(X) before computing dispersions, but this is not done when method='cell_ranger'.

My conclusion is that the documentation should be updated to reflect that when flavor='cell_ranger', non-logarithmized data is expected. But I would very much appreciate clarification on the issue, it has been a long-standing source of confusion in our lab. Thank you.

mjhubisz avatar Dec 15 '20 17:12 mjhubisz

I'm confused too. The documentation says that flavor ='seurat' or flavor ='cell_ranger' needs logarithmic data. Why the data is transformed back out of logspace using X=np.expm1(X) if flavor='seurat' ? Doesn't this do nothing if expm1(log1p(X))?

huang-sh avatar Aug 22 '22 08:08 huang-sh

Hi, same confusion here. According to: https://github.com/scverse/scanpy/issues/969#issuecomment-629667682 If I set flavor ='cell_ranger', dose it mean I should not use sc.pp.log1p(adata) to ensure use the "library size normalized counts"(not log)?

dyinboisry4u avatar Apr 10 '23 06:04 dyinboisry4u