Support training from `AnnCollection`
Is your feature request related to a problem? Please describe.
- Currently,
scvi-toolsdoes not support training from the newanndata.experimental.AnnCollectionAPI AnnCollectionis great! For teams training models on large datasets, it's a game changer.
Describe the solution you'd like
- Support for training from
AnnCollectionobjects through the existing API would be great. - Ideally, users would not be required to make substantial changes to the existing
scvi-toolsworkflow
Question for maintainers
- I've implemented a solution that achieves the desired outcome without modifying any existing
scvi-toolscode. - In brief, I wrote a set of wrappers for
AnnCollectionthat mimic theanndata.AnnDataAPI in all the waysscvi-toolsexpects. We've successfully trained simple models with this solution. - In practice, users wrap their collection objects (
wrapped_collection = Wrapper(collection)) then proceed with thescvi-toolsworkflow as normal (setup_anndata(wrapped_collection, ...), etc.). - Would the team be interested in incorporating this interface into the main
scvi-tools? If so, I can send in a PR. I'd imagine it living as a separate module under.data.
Hi, thanks for the suggestion. We are currently looking into supporting MappedCollection from lamindb. However, AnnCollection works with setup_anndata while MappedCollection requires Custom Dataloaders. Do you used AnnCollection in disk-backed mode or are the datasets loaded to memory? Could you provide the code and we can then discuss within scverse with the AnnData developers how to enable this and how stable AnnCollection is? We could have a similar function to e.g. the organize multiome function in multiVI https://github.com/scverse/scvi-tools/blob/c53efe06379c866e36e549afbb8158a120b82d14/src/scvi/data/_preprocessing.py#L283.
However, AnnCollection works with setup_anndata while MappedCollection requires Custom Dataloaders. Do you used AnnCollection in disk-backed mode or are the datasets loaded to memory?
I set it up to use AnnCollection with backed AnnData objects. I don't really see an advantage to using AnnCollection if everything is in memory anyway -- the overhead of anndata.concat(...) is pretty minimal.
Here's a sample snippet of how I created the objects.
# get some data
gdown.download(url="https://drive.google.com/uc?id=1X5N9rOaIqiGxZRyr1fyZ6NpDPeATXoaC",
output="pbmc_seurat_v4.h5ad", quiet=False)
gdown.download(url="https://drive.google.com/uc?id=1JgaXNwNeoEqX7zJL-jJD3cfXDGurMrq9",
output="covid_cite.h5ad", quiet=False)
# load in backed
covid = sc.read('covid_cite.h5ad', backed="r")
pbmc = sc.read('pbmc_seurat_v4.h5ad', backed="r")
# make a collection
collection = AnnCollection([covid, pbmc], join_vars="inner", join_obs="inner", label='dataset')
# use the wrapper
wrapped_collection = AnnFaker(collection)
# train a model
scvi.model.SCVI.setup_anndata(
wrapped_collection,
layer="test",
batch_key="dataset",
)
model = scvi.model.SCANVI(wrapped_collection, n_latent=10)
model.train(max_epochs=20)
# training completes, latent matches expectations
Could you provide the code and we can then discuss within scverse with the AnnData developers how to enable this and how stable AnnCollection is?
Sure, here's a minimal implementation in colab: https://colab.research.google.com/drive/1v9B62IfLM8qBfgmvDYnCs3GZaaUvnG26?usp=sharing
Hi Jacob, the current AnnData developers have no plan to continue developing AnnCollection but the idea is to focus on the dask integration. This should handle it similarly. I'm a bit reluctant to built for an experimental feature that is currently not actively developed. We are still working on MappedCollection support, which is an on-disk concatenation as well. I hope that these cover your use case. Please reopen if you think it's critical to support AnnCollection over the other two solutions.
To follow up on it, the MappedCollection release is now production ready (we still need to fix the tests there): https://github.com/scverse/scvi-tools/pull/3193/files. MappedCollection has several advantages over AnnCollection in terms of io but requires setting up a lamindb instance. We will post in the near future a small tutorial using this: test_lamindb_dataloader_scvi_scanvi contains the code to set this up.