SnapATAC2
SnapATAC2 copied to clipboard
AnnDataSet docs/ comparison with MuData?
Are there any docs available for the AnnDataSet
class? I couldn't find a page in the docs.
AFAICT, it looks quite like a MuData object, in that it has top level obs
, var
, etc. and a collection of AnnData objects.
One advantage of your implementation here is the out-of-core support. Are there any other advantages to the structure you've used here?
I will add the docs later.
AnnDataSet
is less powerful than MuData
, as AnnDataSet
assumes the underlying anndatas come from homogeneous sources, e.g., same type of experiment, having exactly the same var
(at least for now). AnnDataSet
is more similar to AnnDataCollection
as the goal of both is to lazily concatenate multiple anndatas. Because the semantics of AnnDataCollection
doesn't take full advantage of the out-of-core nature of my anndata Rust implementation, I implemented AnnDataSet
to replace AnnDataCollection
. Here are the key features:
-
AnnDataSet
has its own annotations (obs
,obsm
, etc). In the mean time, it provides lazy (and partial) access to individual anndata as well as the concatenated view (currently limited to row-concatenation) of anndatas, includingX
,obs
,obsm
,obsp
.AnnDataCollection
has lazy access toX
, while copying theobs
andobsm
. -
AnnDataSet
is simply stored ash5ad
format file. This is deliberate so it can be shared as an AnnData object when you don't need the access to underlying anndatas. - Unlike
MuData
and similar toAnnDataCollection
,AnnDataSet
doesn't copy AnnData object. It stores the links to individual anndata files. This comes with pros and cons. First, it save disk space and IO. Storing anndata objects as individual files also makes it easier to do parallel computing. However, this strategy does makeAnnDataSet
a bit fragile as one may move the files or replace the files. These will make theAnnDataSet
invalid. To mitigate this, I allow users to update the anndata file locations when opening an AnnDataSet object. - Loading AnnDataSet is fast when you use
no_check = True
, allowing you instantly open large number of AnnData objects and read elements into memory only when necessary. This provides better user experience (you don't need to wait for a long time for the files to be open and copied to memory) and may be useful if you want to serve AnnDataSet using a web server.