SnapATAC2 icon indicating copy to clipboard operation
SnapATAC2 copied to clipboard

AnnDataSet docs/ comparison with MuData?

Open ivirshup opened this issue 2 years ago • 1 comments

Are there any docs available for the AnnDataSet class? I couldn't find a page in the docs.

AFAICT, it looks quite like a MuData object, in that it has top level obs, var, etc. and a collection of AnnData objects.

One advantage of your implementation here is the out-of-core support. Are there any other advantages to the structure you've used here?

ivirshup avatar Apr 26 '22 10:04 ivirshup

I will add the docs later.

AnnDataSet is less powerful than MuData, as AnnDataSet assumes the underlying anndatas come from homogeneous sources, e.g., same type of experiment, having exactly the same var (at least for now). AnnDataSet is more similar to AnnDataCollection as the goal of both is to lazily concatenate multiple anndatas. Because the semantics of AnnDataCollection doesn't take full advantage of the out-of-core nature of my anndata Rust implementation, I implemented AnnDataSet to replace AnnDataCollection. Here are the key features:

  1. AnnDataSet has its own annotations (obs, obsm, etc). In the mean time, it provides lazy (and partial) access to individual anndata as well as the concatenated view (currently limited to row-concatenation) of anndatas, including X, obs, obsm, obsp. AnnDataCollection has lazy access to X, while copying the obs and obsm.
  2. AnnDataSet is simply stored as h5ad format file. This is deliberate so it can be shared as an AnnData object when you don't need the access to underlying anndatas.
  3. Unlike MuData and similar to AnnDataCollection, AnnDataSet doesn't copy AnnData object. It stores the links to individual anndata files. This comes with pros and cons. First, it save disk space and IO. Storing anndata objects as individual files also makes it easier to do parallel computing. However, this strategy does make AnnDataSet a bit fragile as one may move the files or replace the files. These will make the AnnDataSet invalid. To mitigate this, I allow users to update the anndata file locations when opening an AnnDataSet object.
  4. Loading AnnDataSet is fast when you use no_check = True, allowing you instantly open large number of AnnData objects and read elements into memory only when necessary. This provides better user experience (you don't need to wait for a long time for the files to be open and copied to memory) and may be useful if you want to serve AnnDataSet using a web server.

kaizhang avatar Apr 26 '22 17:04 kaizhang