scanpy Spin off read/write into it's own package?

Spin off read/write into it's own package?

Open adamgayoso opened this issue 3 years ago • 17 comments

As we are refactoring scvi, I'm wondering what the utility would be to spin off the i/o part of scanpy into it's own lightweight package that's more general for reading single cell pipeline outputs into anndata. For example, we'd like to use the scanpy read from 10x/visium functions, but don't necessarily want to have scanpy be a full requirement. It's also a bit confusing why something like read_umi_tools is in anndata but not read_10x_h5. The same goes for loom to some extent. I could imagine either moving such functionality to anndata or a standalone package that could be expanded to include support for other technologies like scATAC-seq.

This overall could be a big benefit to methods developers who would like to build on anndata.

Aug 21 '20 17:08 adamgayoso

This makes me think of a few things.

Moving 10x reading functions to `anndata`

Initially the idea was all single cell stuff should go into scanpy. Since we read loom into anndata, I think we'd be okay with putting the 10x readers into anndata, especially if there is broad consensus this would make development of other packages easier. However, they would need to be re-written to use h5py instead of pytables.

`scanpy` as a requirement

Is the requirement of scanpy so bad? If there are pain points here, should we be trying to make a basic scanpy installation lighter weight?

Splitting off new modules

Finally, the idea of having IO functions go into their own package. I think this is a much bigger change, and I'd like to see a more fleshed out case for it. This would add a fair bit of complexity to development, so I'd want to be sure it's worth it. Some general questions I have:

What are the advantages/ disadvantages of having smaller sub-packages?
- How does this impact users vs. developers?
Is IO special, or should more parts go into sub-packages?
What gets re-exported from "main" modules?
Who manages the sub-packages?

A more specific question: how modular of a component is IO? In some cases, like reading a transcriptomic only datasets, I'd say very. For more recent developments, like visium, I'm not sure this is the case. What we read in, and how we represent it, is very tightly coupled to the methods we have. Until the data's been around for a bit longer, I think it would make sense to keep visium IO in the same package as methods for it.

Aug 25 '20 08:08 ivirshup

Moving 10x reading functions to anndata

I haven't worked much with h5py or tables, is it time-consuming to refactor these functions? It seems like moving to anndata is the most straightforward solution at least logically to me.

scanpy as a requirement

I like scanpy, but the only thing we really require in scvi is the data loading part. A user could take their scvi outputs and go use Seurat if that makes them happy. And then like the data loading functions are simple enough that we could just implement them ourselves. I'm sure a lot of people are currently doing this, which inspired the idea to have a standalone package.

Splitting off new modules

Your questions are very valid. I don't really have good answers for them. I could just see a standalone package being widely used and community driven, especially if there is some scanpy backing + maybe optional dependencies/functionality to get your objects ready for R analysis pipelines.

Aug 25 '20 18:08 adamgayoso

I haven't worked much with h5py or tables, is it time-consuming to refactor these functions? It seems like moving to anndata is the most straightforward solution at least logically to me.

In this case, I think it should be fine. It might not happen too soon if we're left to our own devices, so a PR is welcome.

I could just see a standalone package being widely used and community driven

What formats that aren't in anndata would you see in this package? I'm trying to get an idea of the kind of scope you're thinking of here. I think there are formats where there isn't one obvious "right way" to represent them as an AnnData object (e.g. visium), so having a canonical reading/ writing function is difficult.

Aug 31 '20 06:08 ivirshup

pytables is starting to throw warnings, so it may be time for a rewrite and moving the function.

Jul 14 '21 08:07 ivirshup

I would love to see file I/O in Anndata. I imagine this would make things easier for episcanpy as well. That package can then focus more on setting up count tables where they are not nicely provided. Otherwise it becomes a bit difficult for the new user (me) to distinguish data loading and setting up new tables.

Jul 14 '21 08:07 LuckyMD

Splitting off new modules Finally, the idea of having IO functions go into their own package. I think this is a much bigger change, and I'd like to see a more fleshed out case for it. This would add a fair bit of complexity to development, so I'd want to be sure it's worth it. Some general questions I have:

What are the advantages/ disadvantages of having smaller sub-packages?

method developer would just depend on those instead of (multiple) analysis package

How does this impact users vs. developers?

user none, as the analysis package would ofc have the IO as dep. developer would be impacted by a leaner dep tree

Is IO special, or should more parts go into sub-packages?

it kind of is imho, it's all about having whatever data there is in an anndata/mudata shape. I must say that I'd also think plotting could be it's own separate package but it would probably require a lot of refactoring across packages (thinking about duplication of scanpy/scvelo code)

What gets re-exported from "main" modules?

didn't get this sorry

Who manages the sub-packages?

the IO subpackage? everyone 😅

so beside being in favour, it might also be that other issues arise. For instance, for modality-specific formats we'd have to rely on specific external libraries which would then have to be lazily imported (as pointed out before). Would this create the premise of exponential growing of modality-specific lazy import libraries? probably yes. Is this best practice? I don't know.

Mar 04 '22 17:03 giovp

it kind of is imho, it's all about having whatever data there is in an anndata/mudata shape. I must say that I'd also think plotting could be it's own separate package but it would probably require a lot of refactoring across packages (thinking about duplication of scanpy/scvelo code)

@giovp and I are synced on these ideas :)

Mar 04 '22 17:03 adamgayoso

Would this create the premise of exponential growing of modality-specific lazy import libraries? probably yes. Is this best practice? I don't know.

Could be, but even in this case, I think centralization is so so important and this package could receive a lot of community interaction....

Mar 04 '22 17:03 adamgayoso

Here are instances that could have leveraged scio

Mar 04 '22 18:03 adamgayoso

I'm wondering if we can come to some agreement on a slight modification to this proposal.

How does this impact users vs. developers?

user none, as the analysis package would ofc have the IO as dep. developer would be impacted by a leaner dep tree

This seems good.

Who manages the sub-packages?

the IO subpackage? everyone 😅

😅 indeed

For instance, for modality-specific formats we'd have to rely on specific external libraries which would then have to be lazily imported (as pointed out before). Would this create the premise of exponential growing of modality-specific lazy import libraries? probably yes. Is this best practice? I don't know.

I feel like complicated dependency management was what we were trying to avoid here.

Also it's nice when you install a package call a function and it works, less nice to have to start mucking around with dependencies.

An alternative: project specific IO

squidpy_io, muon_io

Packages which read in package specific formats with a minimal set of dependencies.

We can keep muon.read_10x_atac, so nothing changes for users.

We skip out on complicated ownership and complicated dependencies. This should be very low overhead.

Mar 04 '22 21:03 ivirshup

Project specific IO is interesting but IMO makes it even more complicated in some ways. The current biggest problem we face is that no one knows where to go to read certain formats.... scanpy? muon? squidpy? Scanpy has read visium but squidpy is the spatial package? I can analyze atac data in scanpy but need to use muon to read the file?

Seurat has basically every reader one would need. This kind of fractured environment is not going to help us gain ground.

Who manages the sub-packages?

Scverse (also it's one package not many). We are talking about 5-15 readers that have been touched a handful of times in 4-5 years. I don't think this is a complicated package to maintain. Agree that one person needs to take the lead on releases (probably very infrequent).

I feel like complicated dependency management was what we were trying to avoid here.

Where is the complicated dependency management? We have a core set of readers (h5, pandas, scipy) and more complex readers (lazy import). We can have a conda env file too for everything if we want. Even anndata lazy imports loom for example. It's a small price to pay for ecosystem synchronization and enhanced user experience.

Packages which read in package specific formats with a minimal set of dependencies.

It's also unclear to me what package specific stuff muon has in particular. The way I see it there's one read_10x_h5(return_anndata=True, return_mudata=False, gex_only=None) I don't think muon is loading any extra information or putting it in any package specific places?

How does this impact users vs. developers?

Developers: (1) export scio readers into their packages, can contribute improvements to readers, (2), access to many more practical readers for their packages (scvi-tools has no 10x h5 reader because we don't feel the need to depend on scanpy for one function)

Users: (1) no impact if they continue using the packages they like (e.g., scanpy reader will be completely unchanged). (2) Can go ahead and just use scio and then be on their way (a reality that many people do not feel the need to use scanpy/muon). If there are R converters, this would be a major use case.

What we read in, and how we represent it, is very tightly coupled to the methods we have.

Up for discussion, but read the maximal amount of information by default. If necessary (don't see any particular cases at the moment), package devs use the underlying scio function and reorganize.

Mar 04 '22 21:03 adamgayoso

I am more and more convinced about having a single package for the reasons @adamgayoso mentioned. To address a few concerns from above:

Who manages the sub-packages?

Scverse (also it's one package not many). We are talking about 5-15 readers that have been touched a handful of times in 4-5 years. I don't think this is a complicated package to maintain. Agree that one person needs to take the lead on releases (probably very infrequent).

Scverse core developers could take turns (e.g. every 6 months) in being "lead maintainer", i.e. in charge of releases and first-responders to issues (delegating them to the most appropriate people). This has the additional advantage that everything needs to be documented to a point that there can't be a single point of failure.

Also it's nice when you install a package call a function and it works, less nice to have to start mucking around with dependencies.

pip install scio[all]

could be broadly advertised in the README. Packages could still use the slimmer version, e.g. in scirpy, I could depend on scio[vdj].

I think there are formats where there isn't one obvious "right way" to represent them as an AnnData object (e.g. visium), so having a canonical reading/ writing function is difficult.

I think we should aim at having one obvious "right way" to represent something with AnnData and MuData. A common scio package could be a way to achieve that.

I know squidpy will be changing its representation and I think muon should have changes to the ATAC representation. Also muon and scvi-tools read in different things from 10x atac data.

A solution to that would be versioned schemata. E.g. whatever squidpy uses now is the "spatial schema v1". When we come up with a better way it becomes the "spatial schema v2". Old schemata will be deprecated but can stick around for a while. If a schema is experimental and subject to active changes it can be v0.1.

scio.spatial.read_visium(path, schema="v1")

Mar 05 '22 09:03 grst

I think we should aim at having one obvious "right way" to represent something with AnnData and MuData. A common scio package could be a way to achieve that.

Agree, also R packages seem to be doing just fine with Seurat/Bioconductor representation?

Also muon and scvi-tools read in different things from 10x atac data.

This is not intentional at all, muon read atac data would work just fine in our package.

A solution to that would be versioned schemata.

That could be good, the schema also don't have to be versioned, we can just have a few options and package devs wrap the method with their choice.

Mar 05 '22 17:03 adamgayoso

Been meaning to share this hear, but haven't gotten around writing it up so I'll just write something quick:

I think if we are going to say "here is the way to represent this kind of data" we shouldn't just set that to be whatever we do currently and call it a standard. I think it's worth consulting with the people this affects, like those we need to do interchange with and data repositories.

Feature Object Matrix (FOM) schema working group is organizing such a group of stakeholders to define these standards. I think it would make a lot of sense to have more scverse participation with this group and to plan to adopt their standards.

Apr 26 '22 09:04 ivirshup

Sure, don't see how that's mutually exclusive with having a package. We have a huge problem in the ecosystem right now that it's not straightforward to load data from non rna-seq experiments (no clear guidance where to go etc)

Apr 26 '22 14:04 adamgayoso

I agree with Adam! I'm all for standardization and joining efforts with the FOM group, but this is an effort that my very well take years. We need a temporary solution in the meanwhile.

Apr 28 '22 06:04 grst

agree with @grst -- also:

I think if we are going to say "here is the way to represent this kind of data" we shouldn't just set that to be whatever we do currently and call it a standard.

I mean this is what we are currently doing explicitly, it's just scattered across a few packages.

We really need to fill the current gap in accessibility. The first hit below takes me to a package that doesn't have functioning API documentation (while it might work it's not clear if I don't know what I'm doing).

Apr 28 '22 15:04 adamgayoso

scanpy scanpy copied to clipboard

Spin off read/write into it's own package?

Moving 10x reading functions to anndata

scanpy as a requirement

Splitting off new modules

An alternative: project specific IO

scanpy
scanpy copied to clipboard

Moving 10x reading functions to `anndata`

`scanpy` as a requirement