mudata
mudata copied to clipboard
Feature Request: Structured multi-assay support (counts, data, scale) — beyond SCT, as a general object model
🚀 Feature Request: Structured Multi-Assay Support in AnnData
Summary
Propose native support in AnnData for a structured multi-assay architecture, where each assay (e.g. RNA, ADT, ATAC) can contain standard substructures like counts, data, and scale, similar to Seurat and SingleCellExperiment (SCE). This would extend AnnData’s flexibility and expressiveness for diverse workflows including normalization pipelines, multimodal data, and modular preprocessing strategies.
📌 Motivation
In Seurat and SCE:
- A single object supports multiple assays.
- Each assay has structured slots (
counts,data,scale.data, etc.). - Switching between assays and their layers is clean and reproducible.
In AnnData today:
- Only one
.Xis supported. - Additional views (e.g., normalized, scaled) are stored ad hoc in
.layers. - Multi-modal data or multi-normalization workflows require naming hacks:
adata.layers["rna_counts"]
adata.layers["sct_data"]
adata.obsm["adt"]
There is no grouping of layers, metadata, or parameters by assay.
This creates fragile, non-standard, and hard-to-read pipelines.
💡 Proposal
🔹 Add a formal .assays slot
Allow structured access to multiple assays within a single AnnData object:
adata.assays["RNA"]
adata.assays["SCT"]
adata.assays["ADT"]
🔹 Each Assay would include:
adata.assays["RNA"].layers["counts"] # raw counts
adata.assays["RNA"].layers["data"] # normalized
adata.assays["RNA"].layers["scale"] # scaled
adata.assays["RNA"].X # Optional: shortcut for default layer (e.g., 'data')
adata.assays["RNA"].var # Assay-specific gene metadata
adata.assays["RNA"].uns # Assay-specific parameters/config
🔹 Global utility functions
adata.set_current_assay("RNA")
adata.get_active_layer() # Returns counts / data / scale based on context
✅ Benefits
Clean, standardized support for multiple assays and multiple representations. Better alignment with Seurat and SCE, facilitating interoperability. Enables complex workflows such as: SCTransform (Pearson residuals + raw counts) CLR / log-normalization comparisons RNA + ADT + ATAC integration Denoising / imputation benchmarking Easier switching between data views Clearer pipelines, less risk of user error
🔄 Compatibility
Backwards-compatible: .X can point to the default layer of a default assay. Could gracefully promote .layers into structured sub-objects, preserving existing behavior while offering more structure.
🔍 Related Projects
| Framework | Multi-Assay Support | Layer Structuring | Notes |
|---|---|---|---|
| Seurat (R) | ✅ | ✅ (counts, data, scale.data) |
Well-defined Assay class |
| SCE (R) | ✅ | ✅ | Widely used in Bioconductor |
| MuData (Python) | ✅ | ❌ (overkill for mono-assay) | Designed for multi-modal omics |
| AnnData (current) | ❌ | ❌ | Flat structure with ungrouped layers |
❓ Open Questions
- Would you consider integrating this natively into AnnData?
- Should this be part of AnnData core, or exist as a well-supported extension?
- Would
.assaysbe required for future workflows, or remain optional? - Would you be open to a community-driven prototype or API proposal?
🙏 Thanks
Thank you for your hard work maintaining this essential tool.
AnnData is already an amazing foundation, and this enhancement would further align it with the evolving needs of single-cell analysis workflows across modalities, species, and platforms.
Happy to contribute or help prototype.
Best regards,
Benjamin
@BenjaminDEMAILLE Have you seen https://github.com/scverse/mudata? I think this is probably our go-to for multi-modal integration and has a lot of what you are asking for at first glance.
@BenjaminDEMAILLE Have you seen https://github.com/scverse/mudata? I think this is probably our go-to for multi-modal integration and has a lot of what you are asking for at first glance.
Thanks for the pointer to MuData — it's a great tool for structured multi-modal integration, and I agree it's powerful for managing distinct omics types.
That said, my proposal here is a bit different in scope: it's about adding lightweight, structured multi-assay support directly within a single AnnData object, similar to what Seurat and SingleCellExperiment offer. The motivation is to improve clarity and flexibility for workflows that involve:
- Multiple normalization strategies (e.g., raw counts, log, CLR, SCT)
- Combined modalities like RNA + ADT, without relying heavily on
.obsm - Modular preprocessing pipelines that benefit from assay-specific metadata, parameters, and layers
The idea is not to replace MuData, but to complement it — by making AnnData itself more expressive and ergonomic for many common, non-overlapping use cases.
A structure like:
adata.assays["RNA"].layers["counts"]
adata.assays["RNA"].layers["lognorm"]
adata.assays["RNA"].X # Default view
adata.assays["RNA"].var
…would greatly improve reproducibility and make pipelines more readable and modular.
If there’s interest, I’d be happy to help sketch out a prototype or open a formal API proposal. Thanks again for maintaining these tools and for engaging with the community!
Would you like help drafting a minimal working prototype or mock API for .assays?
The idea is not to replace MuData, but to complement it — by making AnnData itself more expressive and ergonomic for many common, non-overlapping use cases.
Could you maybe flesh this out a bit more? I would not want something literally called .assays here because AnnData is agnostic to the type of data (simply that it follows the obs-var paradigm). For example, looking at what you've posted so far:
Multiple normalization strategies (e.g., raw counts, log, CLR, SCT)
I'm not sure how an .assays feature supports this. The examples you give (from "A structure like:") seem to all use layers. IT seems like what you might be looking for is a standardized location for these sorts of things that scanpy is known to spit out. This is something we could investigate in scanpy. Perhaps we could make something available there to validate that raw counts are both stored in and accessible at a certain location (and same from log etc.). In general, we are moving away from X as an ambiguous default (see scverse/anndata#244).
Combined modalities like RNA + ADT, without relying heavily on .obsm
In this case, I'm not sure I see how what you're proposing is substantially different than what MuData offers, except it is missing the ability to match observation spaces (which MuData offers). I agree that what is here is "lightweight" for the user in this specific use-case, but it becomes a heavyweight developer burden (since it would be large refactor that we would then have to maintain) as well as a user-burden for any breaking on-disk changes that might have to occur.
Modular preprocessing pipelines that benefit from assay-specific metadata, parameters, and layers
This seems like a mix of the above two things.
Another thing would be whether .assays would be default (i.e., you always need a "labeled" assay or not). Again, we want to be agnostic to the type of data being used here, not just having it be biological/sequencing focused.
If you could distinguish a bit more between the above points about MuData/scanpy and what you are proposing, that would be helpful to understand a bit more. Thanks for the issue!
Thanks for the thoughtful reply!
To clarify: the idea is not to replace MuData, but to complement it by making AnnData a bit more ergonomic for common workflows — especially those that don’t require full multi-modal support, but would still benefit from a bit more structure.
I’m mainly taking inspiration from Seurat’s data structure, where multiple assays (RNA, ADT, etc.) are organized in a consistent way, each with their own set of normalized data, scaling parameters, variable features, etc. This structure has proven very practical for modular pipelines, while still keeping things explicit and user-accessible.
I agree that the examples I gave could be implemented using layers, but the point is about standardizing their organization — not the mechanics. In Scanpy, things like log1p, scale, highly_variable_genes, and raw are scattered across attributes (X, layers, uns, etc.) with no clear way to track multiple variants in parallel. This is not a limitation of AnnData, but rather a question of convention and discoverability.
For example, having a designated .assays container (or something similarly named, not necessarily .assays) could: • Let users track multiple preprocessing branches (e.g., raw, log, CLR, SCT) in a consistent structure, • Improve modularity and clarity in pipelines, • Enable tools to more easily validate assumptions (e.g., “is there a raw count matrix stored and where?”), • Avoid overloading obsm or layers with opaque keys or modality mixing.
I’m absolutely not suggesting that every AnnData must have such a structure, or that it should enforce biology-specific logic. Rather, it’s about offering an optional, standardized schema — one that could be checked, manipulated, or extended by higher-level tools (like Scanpy or others), but doesn’t constrain the base object.
I’d be happy to work on a small proposal or prototype that shows how this might look in practice — staying light, explicit, and backward-compatible. Thanks again for the discussion!
@BenjaminDEMAILLE It sounds like you have a great idea for a new package here. Built upon anndata and scanpy it could have
- A new container with the desired
assaysslot that is used internally and returned to users - Wrappers around common scanpy functions that insert things into predefined keys
- A list of such keys available to users as well as accessors on the container itself/sub-
AnnDataobjects
https://github.com/scverse/anndata/pull/1870 should make this easier as you could implement adata.standardized_pipeline_package.get_counts directly. Or you could implement the assays feature directly with that, and then have the "default" AnnData object be completely empty (just spitballing, not sure what things like adata.X would do when you have adata.assays["RNA"] - be empty? not exist at all?).
I don't want to say "no" because I haven't seen the idea fully laid out, and you never know what can come of that, but at the moment I don't think this feature weighs strongly enough against the idea of rewriting large swaths of the codebase, figuring out the io (which we would then have to be resposible for maintaining ad infinitum) etc.
What do you think?
I feel like this overlaps super strongly with MuData besides the fact that MuData does not use the term assay and does not have a default modality (assay). I am much rather wondering whether any of your ideas could make it into MuData instead.
@gtca
@Zethson agreed - I also pinged @flying-sheep about a possible scanpy 2.0 accessor the log counts i.e., adata.scanpy_accessor.get_counts, but this is a ways off, so possible to have something in the interim
Thanks, @BenjaminDEMAILLE!
From the last code snippet, this is basically what the MuData/AnnData stack already provides:
mdata["RNA"].layers["counts"]
mdata["RNA"].layers["lognorm"]
mdata["RNA"].X
mdata["RNA"].var
I believe all the three points about flexible workflows are addressed by its current design.
And just to mention it here for completeness, there were other discussions about the multi-assay design such as https://github.com/scverse/anndata/issues/237.