Reading and writing anndata in chunks
Hi, I have been trying to create a anndata h5ad using anndata package with multiple 10X experiments. I have separate huge files for each experiment and I would like to merge them into one single file.Its same as concatenating different h5ads. But the problem is I can’t load all the .X into the memory. Is there a way where I can read the data from h5 or csv or h5ad in chunks and keep writing the chunk to a file and again load the new chunk and write it to file on disk ,just like how h5py resize of h5 works? Keeping the h5ad in append mode and resize the h5ad and write to the file on disk, so that I need not have data in the memory which is already written to a file. Is there an option like this? Please help me.
@davidsebfischer has a solution for that. It wasn’t generally applicable, but I forgot why. In any case, he could probably outline to you how to do it with a few lines of code here.
Thank you @flying-sheep for your reply. @davidsebfischer please let me know your suggestions, in this problem.Thank you.
@LisaSikkema and @cdedonno this may be relevant for you guys soon as well.
@flying-sheep do you want this as an anndata functionality (PR) or as a code snippet (example code)?
For now as a code snippet here to help @veda11391 fast, but if we can include it in anndata in a generic enough way, it would be great to have it there in the long run.
Thank you for reply @davidsebfischer and @flying-sheep. @davidsebfischer Could you please guide me on how I can achieve this? I was going through scripts and if you can guide me, it would be really helpful.
@flying-sheep, I think we've already got the building blocks for this – out-of-core concatenation and array-format conversion. I've made some progress on adding this to concatenate but have other work which is taking priority for now. Maybe in a month or so?
Sure, but can we have some code block in this issue that helps someone stumbling upon this issue?
Sure, I had thought @davidsebfischer had something for that? Pending that:
@veda11391, what data do you need to concatenate? Is it just AnnData's with X, obs, and var?
@veda11391, this should work, but has some conditions:
- I'm assuming all of your data has the same
var_names - I'm assuming all of your
Xs are incsrformat - This will work now, but everything in private modules is subject to change.
Generating some example data:
from pathlib import Path
import anndata as ad
import h5py
import pandas as pd
from scipy import sparse
tmp_dir = Path("tmp_h5ads/")
tmp_dir.mkdir(exist_ok=True)
backed_adatas = []
for prefix in list("abcde"):
pth = tmp_dir / f"{prefix}.h5ad"
adata = sc.AnnData(
sparse.random(100, 100, density=0.3, format="csr"),
obs=pd.DataFrame(index=[f"{prefix}-cell{i}" for i in range(100)])
)
adata.write(pth)
backed_adatas.append(ad.read_h5ad(pth, backed="r"))
And the actual logic:
with h5py.File("concatenated.h5ad", "w") as f:
ad._io.h5ad.write_attribute(f, "X", sparse.csr_matrix((0, 100)))
base_dset = ad._core.sparse_dataset.SparseDataset(f["X"])
# Concatenate adatas
for adata in backed_adatas:
base_dset.append(adata.X)
ad._io.h5ad.write_attribute(f, "obs", pd.concat((a.obs for a in backed_adatas)))
ad._io.h5ad.write_attribute(f, "var", backed_adatas[0].var)
Hey @ivirshup, did this ever make it's way into anndata? If not, is the internal _core API stable enough to depend upon?
Thank you ivirshup for the solution. Just wanted to add, that if you are trying this with anndata version 0.9.1, you need to replace write_attribute with write_elem.