mudata icon indicating copy to clipboard operation
mudata copied to clipboard

Concat on disk

Open joshchiou opened this issue 8 months ago • 3 comments

I would like to move to MuData for multiomic data, but one roadblock so far has been concatenating mudata objects that are too large to fit in memory. Are there any plans to implement some of the functions from AnnData e.g. concat_on_disk that could get around memory limitations?

joshchiou avatar Apr 15 '25 15:04 joshchiou

Thanks for raising this, @joshchiou!

@flying-sheep, @ilan-gold — is concat_on_disk going to graduate from .experimental or rather be deprecated in favor of the new lazy approaches?

@joshchiou, would you mind sharing some description of your workflow that would require object concatenation? It will help to better understand how we can support it!

gtca avatar Apr 22 '25 22:04 gtca

Sure, MultiVI (starting from v.1.2.2) supports using MuData objects instead of stacked AnnData objects. I have been concatenating multimodal (RNA + ATAC) AnnData across hundreds of samples, which is too large to fit into memory, so I was curious if there were plans to implement something similar for MuData objects.

Alternatively (and probably a better path forward), I saw that MappedCollection was planned for a future release of scVI. I'm not sure if this is already compatible with MuData objects, but if not I would love to see it be a feature.

Here is one example of MappedCollection in action with AnnData.

joshchiou avatar Apr 22 '25 23:04 joshchiou

is concat_on_disk going to graduate from .experimental or rather be deprecated in favor of the new lazy approaches?

There are some open issues that I think we should resolve before graduating. But it definitely will not be replaced by the lazy approach - there are good reasons to concatenate a bunch of anndata objects on disk.

The lazy approach is useful for creating an in-memory representation to explore concatenated objects - concatenating objects should be both fast and easy: https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_lazy.html (see last code-cell).

But for deep learning, it is almost certainly faster to create a concatenated, shuffled, on-disk representation and then use that directly than mapped collection etc. More to come on that soon.

ilan-gold avatar Apr 23 '25 08:04 ilan-gold