scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Add counts layer automatically

Open mbuttner opened this issue 2 years ago • 8 comments

  • [ ] Additional function parameters / changed functionality / changed defaults?

Hi there, it would be great to have the counts layer being added automatically to the anndata object when data is being normalised. Could this be added to the sc.pp.normalize_total function?

...

mbuttner avatar May 23 '22 11:05 mbuttner

before normolization, you can do adata.layers['counts']=adata.X.copy() to add the counts of all genes to the layers.

wubaosheng avatar Jun 12 '22 01:06 wubaosheng

@gtca has wanted this too (e.g. https://github.com/scverse/anndata/issues/706)

@gtca, we've talked about implementing this, and what the API could look like. Did we write more than the referenced issue?

ivirshup avatar Jun 15 '22 21:06 ivirshup

@ivirshup I don't think so, unless there's work towards https://github.com/scverse/anndata/issues/244.

To follow the ideas in https://github.com/scverse/anndata/issues/706, seems like the steps would be:

  • [ ] add an attribute ._X_layer to store which layer .X references;
  • [ ] use .X to reference .layers[._X_layer];
  • [ ] add in_layer= and out_layer= arguments to scanpy's .pp functions;
  • [ ] these functions will also alter ._X_layer.

The second to last point can actually be implemented irrespective of the AnnData change as in_layer=None will mean taking .X. The question is, should we consider changing the defaults right away, e.g. in_layer="counts", out_layer="lognorm"?

gtca avatar Jun 15 '22 23:06 gtca

Related question - is it necessary to do

adata.layers['counts']=adata.X.copy()

or is:

adata.layers['counts']=adata.X sufficient?

carmensandoval avatar Feb 02 '23 06:02 carmensandoval

@carmensandoval, there's no implicit copying so one should make a copy explicitly.

Please see the following code for more details:

import numpy as np
from anndata import AnnData
from jax import random

adata = AnnData(X=np.array(random.normal(random.PRNGKey(0), (100, 10))))

print(id(adata.X))
# => 5393766064

adata.layers["X"] = adata.X
adata.layers["X_copy"] = adata.X.copy()

for layer in ("X", "X_copy"):
    print(f"{layer}: ", id(adata.layers[layer]))
# => X:  5393766064
# => X_copy:  5393773552

print(adata.X[0, 0])
# => -1.5721827

adata.X[0, 0] = 0.0
for layer in ("X", "X_copy"):
    print(f"{layer}: ", adata.layers[layer][0, 0])
# => X:  0.0
# => X_copy:  -1.5721827

gtca avatar Feb 02 '23 13:02 gtca

So far as I can tell, any further downstream operations also acts on layers... so it is not useful to store raw counts there since they will just be modified with counts normalization, log normalization, etc. Storing things in layers sequentially, I just end up with a bunch of layers that all are identically fully processed rather than preserving the raw-er aspect of the counts matrix. Not sure if this is new behavior but it is super frustrating

cdpolt avatar Apr 22 '24 18:04 cdpolt

@cdpolt, is there are specific change ("new behavior") you're referring to?

Storing things in layers sequentially, I just end up with a bunch of layers that all are identically fully processed

Would the code in the previous message be helpful to understand why that happens and how to fix that?

gtca avatar Apr 22 '24 23:04 gtca

@gtca Yes I now see the point about explicit copying preventing the further modification, thanks, that's perfect

cdpolt avatar Apr 22 '24 23:04 cdpolt