scanpy
scanpy copied to clipboard
Add counts layer automatically
- [ ] Additional function parameters / changed functionality / changed defaults?
Hi there,
it would be great to have the counts layer being added automatically to the anndata object when data is being normalised. Could this be added to the sc.pp.normalize_total
function?
...
before normolization, you can do adata.layers['counts']=adata.X.copy() to add the counts of all genes to the layers.
@gtca has wanted this too (e.g. https://github.com/scverse/anndata/issues/706)
@gtca, we've talked about implementing this, and what the API could look like. Did we write more than the referenced issue?
@ivirshup I don't think so, unless there's work towards https://github.com/scverse/anndata/issues/244.
To follow the ideas in https://github.com/scverse/anndata/issues/706, seems like the steps would be:
- [ ] add an attribute
._X_layer
to store which layer.X
references; - [ ] use
.X
to reference.layers[._X_layer]
; - [ ] add
in_layer=
andout_layer=
arguments to scanpy's.pp
functions; - [ ] these functions will also alter
._X_layer
.
The second to last point can actually be implemented irrespective of the AnnData change as in_layer=None
will mean taking .X
.
The question is, should we consider changing the defaults right away, e.g. in_layer="counts", out_layer="lognorm"
?
Related question - is it necessary to do
adata.layers['counts']=adata.X.copy()
or is:
adata.layers['counts']=adata.X
sufficient?
@carmensandoval, there's no implicit copying so one should make a copy explicitly.
Please see the following code for more details:
import numpy as np
from anndata import AnnData
from jax import random
adata = AnnData(X=np.array(random.normal(random.PRNGKey(0), (100, 10))))
print(id(adata.X))
# => 5393766064
adata.layers["X"] = adata.X
adata.layers["X_copy"] = adata.X.copy()
for layer in ("X", "X_copy"):
print(f"{layer}: ", id(adata.layers[layer]))
# => X: 5393766064
# => X_copy: 5393773552
print(adata.X[0, 0])
# => -1.5721827
adata.X[0, 0] = 0.0
for layer in ("X", "X_copy"):
print(f"{layer}: ", adata.layers[layer][0, 0])
# => X: 0.0
# => X_copy: -1.5721827
So far as I can tell, any further downstream operations also acts on layers... so it is not useful to store raw counts there since they will just be modified with counts normalization, log normalization, etc. Storing things in layers sequentially, I just end up with a bunch of layers that all are identically fully processed rather than preserving the raw-er aspect of the counts matrix. Not sure if this is new behavior but it is super frustrating
@cdpolt, is there are specific change ("new behavior") you're referring to?
Storing things in layers sequentially, I just end up with a bunch of layers that all are identically fully processed
Would the code in the previous message be helpful to understand why that happens and how to fix that?
@gtca Yes I now see the point about explicit copying preventing the further modification, thanks, that's perfect