anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Modifying a subset of AnnData using the .iloc/.loc method does not make a new copy, and the original object is modified

Open crazyxiaoj opened this issue 11 months ago • 10 comments

Please make sure these conditions are met

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of anndata.
  • [ ] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

When using the .iloc or .loc methods to modify a subset of an AnnData object, it seems that no new copy is created; instead, the original AnnData object is directly modified.

Code:

from anndata import AnnData
import numpy as np

a = AnnData(X=np.arange(16).reshape(4,4), var=list('ABCD'), obs=list('abcd'))
b = a[:2,:2]
b.obs.iloc[:,:] = 0  # the same results using the .loc method.
b
# View of AnnData object with n_obs × n_vars = 2 × 2
#     obs: 0
#     var: 0
a.obs
#    0
# 0  0
# 1  0
# 2  c
# 3  d

As a beginner, I'm not sure if this behavior is a bug or by design. Could someone clarify whether this is intentional, and if so, could you please explain why it functions this way? Thanks for your assistance!

Versions

| Package | Version |
| ------- | ------- |
| pandas  | 2.2.3   |
| anndata | 0.11.3  |
| numpy   | 2.1.3   |
| Dependency         | Version     |
| ------------------ | ----------- |
| Pygments           | 2.18.0      |
| matplotlib         | 3.9.3       |
| defusedxml         | 0.7.1       |
| traitlets          | 5.14.3      |
| stack_data         | 0.6.3       |
| decorator          | 5.1.1       |
| jaraco.text        | 3.12.1      |
| six                | 1.17.0      |
| charset-normalizer | 3.4.0       |
| scipy              | 1.14.1      |
| pillow             | 11.0.0      |
| pyparsing          | 3.2.0       |
| session-info2      | 0.1.2       |
| platformdirs       | 4.3.6       |
| packaging          | 24.2        |
| h5py               | 3.12.1      |
| jaraco.collections | 5.1.0       |
| jaraco.context     | 5.3.0       |
| setuptools         | 75.6.0      |
| natsort            | 8.4.0       |
| cycler             | 0.12.1      |
| asttokens          | 3.0.0       |
| parso              | 0.8.4       |
| python-dateutil    | 2.9.0.post0 |
| kiwisolver         | 1.4.7       |
| jedi               | 0.19.2      |
| prompt_toolkit     | 3.0.48      |
| ipython            | 8.30.0      |
| pytz               | 2024.1      |
| pure_eval          | 0.2.3       |
| more-itertools     | 10.3.0      |
| pickleshare        | 0.7.5       |
| jaraco.functools   | 4.0.1       |
| wcwidth            | 0.2.13      |
| executing          | 2.1.0       |
| Component | Info                                                                          |
| --------- | ----------------------------------------------------------------------------- |
| Python    | 3.13.1 | packaged by conda-forge | (main, Dec  5 2024, 21:23:54) [GCC 13.3.0] |
| OS        | Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31                 |
| Updated   | 2025-01-27 11:47                                                              |

crazyxiaoj avatar Jan 27 '25 11:01 crazyxiaoj

Hello, I am also trying to subset an AnnData object using some obs values. The dataset is the "Peaks_RNA.loom" found here. Specifically I have an AnnData object called and I want to subset according to the obs columns "Method" and "Tissue". Here the code:

import os
import scanpy as sc
path = ... #path to loom file 
data = sc.read_loom(path) 
print(data.shape) #returns  526094 × 59480
subset_data = data[data.obs["Method"]=="rnaXatac"]
subset_data = data[data.obs["Tissue"].isin(["Cerebellum", "Brain"])]
print(subset_data) # View of AnnData object with n_obs × n_vars = 44333 × 59480 ... 

However I wish it to be a proper AnnData object as to save it into h5ad file. How can I do it? I am using python 3.9.6. Here follows the description of the environment:

anndata==0.10.8
annoy==1.17.3
array_api_compat==1.9.1
bbknn==1.6.0
cellrank==2.0.6
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
Cython==3.0.11
dnspython==2.7.0
docrep==0.3.2
et_xmlfile==2.0.0
exceptiongroup==1.2.2
fcsparser==0.2.8
filelock==3.16.1
fonttools==4.54.1
fsspec==2024.12.0
future==1.0.0
get-annotations==0.1.2
h5py==3.12.1
harmonypy==0.0.10
hyperopt==0.1.2
igraph==0.11.8
importlib_metadata==8.5.0
importlib_resources==6.4.5
jax==0.4.30
jaxlib==0.4.30
jaxopt==0.8.3
Jinja2==3.0.3
joblib==1.4.2
kiwisolver==1.4.7
legacy-api-wrap==1.4
leidenalg==0.10.2
llvmlite==0.43.0
loompy==3.0.7
louvain==0.8.2
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
mellon==1.5.0
ml_dtypes==0.5.1
mofapy2==0.7.2
mpmath==1.3.0
mudata==0.2.4
muon==0.1.6
natsort==8.4.0
networkx==3.2.1
numba==0.60.0
numpy==1.26.4
numpy-groupies==0.11.2
openpyxl==3.1.5
opt_einsum==3.4.0
packaging==24.1
palantir==1.3.6
pandas==2.2.3
patsy==0.5.6
petsc==3.22.0
petsc4py==3.22.0
pillow==11.0.0
progressbar2==4.5.0
protobuf==5.29.0
pygam==0.9.1
Pygments==2.19.1
pygpcca==1.0.4
pymongo==4.10.1
pynndescent==0.5.13
pyparsing==3.2.0
pysam==0.22.1
python-dateutil==2.9.0.post0
python-utils==3.9.0
pytz==2024.2
rich==13.9.4
scanpy==1.10.3
scikit-learn==1.5.2
scikit-misc==0.3.1
scipy==1.11.4
scvelo @ git+https://github.com/theislab/scvelo@22b6e7e6cdb3c321c5a1be4ab2f29486ba01ab4f
scvi==0.6.8
scvi-colab==0.12.0
seaborn==0.13.2
session-info==1.0.0
six==1.16.0
slepc==3.22.1
slepc4py==3.22.1
statsmodels==0.14.4
stdlib-list==0.11.0
sympy==1.13.1
texttable==1.7.0
threadpoolctl==3.5.0
torch==2.5.1
tqdm==4.66.6
typing_extensions==4.12.2
tzdata==2024.2
umap-learn==0.5.7
wrapt==1.16.0
xlrd==2.0.1
zipp==3.20.2

AlessiaLeclercq avatar Jan 28 '25 16:01 AlessiaLeclercq

When using the .iloc or .loc methods to modify a subset of an AnnData object, it seems that no new copy is created; instead, the original AnnData object is directly modified.

@crazyxiaoj as far as I can tell, this behavior is totally expected. A view is just that, a view. So if you edit the view, you'll edit the actual object. It might be worth disallowing this completely, but there are probably cases where the behavior is desirable.

However I wish it to be a proper AnnData object as to save it into h5ad file.

@AlessiaLeclercq If you can't do it directly with the object you have (possible), you certainly can create a copy via copy i.e., adata.copy(): https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.copy.html

import anndata as ad
import numpy as np

adata = ad.AnnData(X=np.array([[1, 2], [3, 4]]))
adata[:1,:].write_h5ad("foo.h5ad") # works, but also `.copy` is fine

ilan-gold avatar Jan 30 '25 16:01 ilan-gold

When using the .iloc or .loc methods to modify a subset of an AnnData object, it seems that no new copy is created; instead, the original AnnData object is directly modified.

@crazyxiaoj as far as I can tell, this behavior is totally expected. A view is just that, a view. So if you edit the view, you'll edit the actual object. It might be worth disallowing this completely, but there are probably cases where the behavior is desirable.

Your explanation is a bit unclear to me. I referred to the content on the following webpage: https://anndata.readthedocs.io/en/stable/generated/anndata.AnnData.html.

Here’s the relevant excerpt:

Copying a view causes an equivalent “real” AnnData object to be generated. Attempting to modify a view (at any attribute except X) is handled in a copy-on-modify manner, meaning the object is initialized in place.

Based on the paragraph above, it appears that modifying properties like obs results in the creation of a new AnnData object. Additionally, I noticed that performing an assignment directly using [], rather than the iloc method, also triggers the creation of a new object.

crazyxiaoj avatar Jan 30 '25 16:01 crazyxiaoj

Based on the paragraph above, it appears that modifying properties like obs results in the creation of a new AnnData object. Additionally, I noticed that performing an assignment directly using [], rather than the iloc method, also triggers the creation of a new object.

Thanks for sharing this. The issue here would be wrapping every single dataframe method. I'm not sure why this wasn't done initially since only drop was wrapped. I was aware of the "copy-on-write" paradigm but I thought the promise was more shallow than this i.e., affecting things only like columns or keys. We should compile a list of things here, I suppose:

  1. set_index (although this one is very bad for other reasons)
  2. loc
  3. iloc
  4. insert
  5. pop
  6. drop_duplicates
  7. rename_axis

and much more. This might be why this wasn't done. So it's possible we should carve out an exception for pandas

ilan-gold avatar Jan 30 '25 16:01 ilan-gold

Thank you for your clarification. I think I'm beginning to understand.

Do you still believe it's necessary to open this issue? If you feel it is no longer needed, we can consider closing this issue.

crazyxiaoj avatar Jan 30 '25 16:01 crazyxiaoj

Do you still believe it's necessary to open this issue? If you feel it is no longer needed, we can consider closing this issue.

Well it is certainly an inconsistency so it seems we should either edit the docs or add the feature set. @ivirshup I've asked to weigh in

ilan-gold avatar Jan 30 '25 16:01 ilan-gold

Thank you!

AlessiaLeclercq avatar Jan 31 '25 13:01 AlessiaLeclercq

One possible solution after discussion:

Disallow all "view" operations i.e., stop subclassing views from the actual object, and just wrap the object along with the needed index, which will then stop people from using iloc on a view at all, forcing them to bring data into memory

ilan-gold avatar Feb 27 '25 17:02 ilan-gold

loc and iloc are both properties which quite complicates things...they return objects that are not part of the public pandas API so we would have to wrap those objects' __setitem__ methods or more likely override the object....

ilan-gold avatar Apr 08 '25 08:04 ilan-gold

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

github-actions[bot] avatar Jun 09 '25 03:06 github-actions[bot]