anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Reading and writing anndata in chunks

Open veda11391 opened this issue 5 years ago • 12 comments

Hi, I have been trying to create a anndata h5ad using anndata package with multiple 10X experiments. I have separate huge files for each experiment and I would like to merge them into one single file.Its same as concatenating different h5ads. But the problem is I can’t load all the .X into the memory. Is there a way where I can read the data from h5 or csv or h5ad in chunks and keep writing the chunk to a file and again load the new chunk and write it to file on disk ,just like how h5py resize of h5 works? Keeping the h5ad in append mode and resize the h5ad and write to the file on disk, so that I need not have data in the memory which is already written to a file. Is there an option like this? Please help me.

veda11391 avatar Jan 30 '20 07:01 veda11391

@davidsebfischer has a solution for that. It wasn’t generally applicable, but I forgot why. In any case, he could probably outline to you how to do it with a few lines of code here.

flying-sheep avatar Jan 30 '20 08:01 flying-sheep

Thank you @flying-sheep for your reply. @davidsebfischer please let me know your suggestions, in this problem.Thank you.

veda11391 avatar Jan 30 '20 08:01 veda11391

@LisaSikkema and @cdedonno this may be relevant for you guys soon as well.

LuckyMD avatar Jan 30 '20 09:01 LuckyMD

@flying-sheep do you want this as an anndata functionality (PR) or as a code snippet (example code)?

davidsebfischer avatar Feb 03 '20 12:02 davidsebfischer

For now as a code snippet here to help @veda11391 fast, but if we can include it in anndata in a generic enough way, it would be great to have it there in the long run.

flying-sheep avatar Feb 03 '20 13:02 flying-sheep

Thank you for reply @davidsebfischer and @flying-sheep. @davidsebfischer Could you please guide me on how I can achieve this? I was going through scripts and if you can guide me, it would be really helpful.

veda11391 avatar Feb 06 '20 12:02 veda11391

@flying-sheep, I think we've already got the building blocks for this – out-of-core concatenation and array-format conversion. I've made some progress on adding this to concatenate but have other work which is taking priority for now. Maybe in a month or so?

ivirshup avatar Feb 09 '20 05:02 ivirshup

Sure, but can we have some code block in this issue that helps someone stumbling upon this issue?

flying-sheep avatar Feb 11 '20 12:02 flying-sheep

Sure, I had thought @davidsebfischer had something for that? Pending that:

@veda11391, what data do you need to concatenate? Is it just AnnData's with X, obs, and var?

ivirshup avatar Feb 12 '20 01:02 ivirshup

@veda11391, this should work, but has some conditions:

  • I'm assuming all of your data has the same var_names
  • I'm assuming all of your Xs are in csr format
  • This will work now, but everything in private modules is subject to change.

Generating some example data:

from pathlib import Path

import anndata as ad
import h5py
import pandas as pd
from scipy import sparse

tmp_dir = Path("tmp_h5ads/")
tmp_dir.mkdir(exist_ok=True)
backed_adatas = []

for prefix in list("abcde"):
    pth = tmp_dir / f"{prefix}.h5ad"
    adata = sc.AnnData(
        sparse.random(100, 100, density=0.3, format="csr"),
        obs=pd.DataFrame(index=[f"{prefix}-cell{i}" for i in range(100)])
    )
    adata.write(pth)
    backed_adatas.append(ad.read_h5ad(pth, backed="r"))

And the actual logic:

with h5py.File("concatenated.h5ad", "w") as f:
    ad._io.h5ad.write_attribute(f, "X", sparse.csr_matrix((0, 100)))
    base_dset = ad._core.sparse_dataset.SparseDataset(f["X"])

    # Concatenate adatas
    for adata in backed_adatas:
        base_dset.append(adata.X)

    ad._io.h5ad.write_attribute(f, "obs", pd.concat((a.obs for a in backed_adatas)))
    ad._io.h5ad.write_attribute(f, "var", backed_adatas[0].var)

ivirshup avatar Feb 13 '20 03:02 ivirshup

Hey @ivirshup, did this ever make it's way into anndata? If not, is the internal _core API stable enough to depend upon?

jacobkimmel avatar Feb 23 '21 23:02 jacobkimmel

Thank you ivirshup for the solution. Just wanted to add, that if you are trying this with anndata version 0.9.1, you need to replace write_attribute with write_elem.

Niklas1225 avatar Jul 01 '23 12:07 Niklas1225