anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Support for genomic ranges

Open stephenkraemer opened this issue 2 years ago • 24 comments

Hi!

in our lab (Schlesner Lab, recently moved from Heidelberg right into your neighborhood - Augsburg University), we have different use cases where it would be quite helpful if AnnData objects could track genomic coordinates and allow subsetting by them, similar to the subsetByOverlaps and subsetByRow methods offered by SummarizedExperiment, MultiAssayExperiment and their derived classes.

If I didn't miss it, this is currently not supported in anndata or its ecosystem (episcanpy, muon, ...), is that right? Are there any plans to support this? We would also be happy to try and come up with an initial implementation (based on pyranges most likely) for further discussion, either as a contribution to anndata, or perhaps to another ecosystem project?

Cheers, Stephen

stephenkraemer avatar Oct 01 '21 13:10 stephenkraemer

We don't support it currently, but this is definitely something I think the ecosystem could use.

I'm unfamiliar with pyranges, how would you propose to integrate it here? Store a pyranges object in varm?

ivirshup avatar Oct 20 '21 16:10 ivirshup

PyRanges is essentially a dataframe on steroids. It could go to varm but maybe there could be issues in pickling it into h5ad files.

dawe avatar Oct 20 '21 16:10 dawe

@dawe As far as I can tell, it mostly uses basic pandas dtypes internally though, right? So it could be saved the same way we do dataframes?

It would complicate things if the objects must be stored split by chromosome and strandedness (from a quick look at pr.data.exons().__dict__ and pr.data.cpg().__dict__

ivirshup avatar Oct 20 '21 16:10 ivirshup

Another solution would be to save just the dataframe and add pyranges-aware specific functions (which import df into a pr object). I’m thinking to functions that perform intersections/sort/subset and return the index of matching vars. Other operations, such as extending/shrinking the intervals, would only change the associated dataframe. I’m any case (chrom,start,end,strand) could be then features in adata.var The most difficult operation would be summarizing intervals according to another interval set (e.g. single cell ATAC peaks => fixed genomic windows) which would create a completely new count matrix.

dawe avatar Oct 20 '21 17:10 dawe

Another solution would be to save just the dataframe and add pyranges-aware specific functions

Yeah, this would also work and be very easy. You'd really want a low initialization overhead, which I haven't investigated.

The most difficult operation would be summarizing intervals according to another interval set (e.g. single cell ATAC peaks => fixed genomic windows) which would create a completely new count matrix.

Is this in scope for pyranges? Would you want to just stream through the reads in something compiled via something compiled?

ivirshup avatar Oct 20 '21 20:10 ivirshup

thanks for your interest! pyranges provides interval operations using a nested containment list (https://github.com/biocore-ntnu/ncls, https://academic.oup.com/bioinformatics/article/23/11/1386/199545). As far as I am aware pyranges constructs one nested containment list per chromosome, per strand. All interval operations are performed on these data structures. During construction of a PyRanges object, the input dataframe of genomic ranges is split into one dataframe per chromosome, per strand. This is the most expensive step of PyRanges construction. One can re-obtain a full dataframe with all genomic ranges after some interval manipulation by explicitely requesting its construction by accessing the PyRanges.df property. Construction of the nested containment lists themselves is very quick. So construction of a PyRanges object could and should certainly be done on demand only. A new pyranges object would anyway have to be constructed anytime the variables in the AnnData object are changed, because the nested containment lists are immutable as far as I am aware. So doing that only on demand is important. It may even be possible to directly work with the underlying ncls library and avoid additional overhead.

It is possible to pickle PyRanges objects, or just the dict of dataframes (per chromosome, per strand) whose construction makes up the major chunk of the PyRanges construction time. I haven't had time to look into the h5ad format, so I am not sure if and how the pyranges object should be added there. Generally construction of PyRanges objects is quite fast for the interval set sizes expected in biological experiments. And automatically loading pyranges objects from h5ad could add quite some memory overhead which may not be needed in the next analysis session.

besides indexing operations (i.e. subsetting by overlap with another interval set) pyranges/ncls provides very convenient functions for finding overlaps between and within interval sets, for example a join method for joining two sets of genomic intervals and a cluster method (which indicates clusters of overlapping intervals within one interval set through integer group ids). pyranges/ncls is cython-based and highly performant, so finding overlaps between AnnData intervals and a second interval set could be done in compiled code such that only a groupby-info is returned to the interpreted level. then aggregation could happen with standard numpy/scipy.sparse functions, again in compiled code of course. Is that what you were thinking about @ivirshup? But I agree with @dawe that aggregation by intervals is one of the more complicated methods (as opposed to e.g. using genomic intervals for just indexing into an AnnData). We would need to return a new AnnData instance where all layers are aggregated and where all information which cannot be matched to the new variables is discarded or otherwise handled (e.g. var, varm, adata.raw). One might also want to consider performing an aggregation by genomic intervals across multiple omics layers in a muon object.

The pyranges/ncls developer @endrebak may also be interested in this discussion and in case you do read this please do correct any errors I may have made in my brief description of the pyranges architecture above!

Because the pyranges object is more than a dataframe aligned to the var dimension, it could not go directly into the varm slot. Spontaneously i think it belongs into uns. It is not meant for direct manipulation as an aligned array/dataframe, but would have to be queried with dedicatd pyranges methods (and the interface we build on top of them), so uns may be a good place. What do you think?

I could work up some more detailed implementation proposals in the next days if that generally sounds good so far.

stephenkraemer avatar Oct 21 '21 07:10 stephenkraemer

I'm ill, but will be sure to read your wall of text Monday XD.

The most difficult operation would be summarizing intervals according to another interval set (e.g. single cell ATAC peaks => fixed genomic windows) which would create a completely new count matrix.

If I understand you correctly this is just atac_peaks.tile(tile_size).count_overlaps(another_interval_set)?

endrebak avatar Oct 21 '21 07:10 endrebak

I'm ill, but will be sure to read your wall of text Monday XD.

Take care!

If I understand you correctly this is just atac_peaks.tile(tile_size).count_overlaps(another_interval_set)?

Mmm, not quite. As another_interval_set is associated to a count matrix (possibly thousands of values for each interval), one needs (naively):

  • create a new matrix with size cell x number of tiles
  • know which interval fall into each tile (and their index in the original matrix)
  • sum values at given indexes to fill each new column in the new matrix

I have done this before but not using PyRanges, I'm a bit concerned about performances: for a human sample you have +600k tiles (at 5kb) and ten times more elements in the another_interval_set. Matrix operations, on the other side, are quite efficient as they are typically performed on csr_matrix

dawe avatar Oct 21 '21 09:10 dawe

If you create an example with inputdata and the desired result I'll look at it and suggest something.

On Thu, Oct 21, 2021 at 11:09 AM Davide Cittaro @.***> wrote:

I'm ill, but will be sure to read your wall of text Monday XD.

Take care!

If I understand you correctly this is just atac_peaks.tile(tile_size).count_overlaps(another_interval_set)?

Mmm, not quite. As another_interval_set is associated to a count matrix (possibly thousands of values for each interval), one needs (naively):

  • create a new matrix with size cell x number of tiles
  • know which interval fall into each tile (and their index in the original matrix)
  • sum values at given indexes to fill each new column in the new matrix

I have done this before but not using PyRanges, I'm a bit concerned about performances: for a human sample you have +600k tiles (at 5kb) and ten times more elements in the another_interval_set. Matrix operations, on the other side, are quite efficient as they are typically performed on csr_matrix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/anndata/issues/624#issuecomment-948412055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUQNJHJEK3TRAXCQAGLUH7KDFANCNFSM5FEUOBQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

endrebak avatar Oct 21 '21 10:10 endrebak

I like your summary Stephen. I think the concatenation, not the splitting, is the expensive step though.

On Thursday, October 21, 2021, Endre Bakken Stovner @.***> wrote:

If you create an example with inputdata and the desired result I'll look at it and suggest something.

On Thu, Oct 21, 2021 at 11:09 AM Davide Cittaro @.***> wrote:

I'm ill, but will be sure to read your wall of text Monday XD.

Take care!

If I understand you correctly this is just atac_peaks.tile(tile_size). count_overlaps(another_interval_set)?

Mmm, not quite. As another_interval_set is associated to a count matrix (possibly thousands of values for each interval), one needs (naively):

  • create a new matrix with size cell x number of tiles
  • know which interval fall into each tile (and their index in the original matrix)
  • sum values at given indexes to fill each new column in the new matrix

I have done this before but not using PyRanges, I'm a bit concerned about performances: for a human sample you have +600k tiles (at 5kb) and ten times more elements in the another_interval_set. Matrix operations, on the other side, are quite efficient as they are typically performed on csr_matrix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/anndata/issues/624#issuecomment-948412055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUQNJHJEK3TRAXCQAGLUH7KDFANCNFSM5FEUOBQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

endrebak avatar Oct 25 '21 10:10 endrebak

I have also realized that subsetting on one interval can be done in pure numpy, without the NCLS. Should update pyranges and post an example here

On Monday, October 25, 2021, Endre Bakken Stovner @.***> wrote:

I like your summary Stephen. I think the concatenation, not the splitting, is the expensive step though.

On Thursday, October 21, 2021, Endre Bakken Stovner @.***> wrote:

If you create an example with inputdata and the desired result I'll look at it and suggest something.

On Thu, Oct 21, 2021 at 11:09 AM Davide Cittaro @.***> wrote:

I'm ill, but will be sure to read your wall of text Monday XD.

Take care!

If I understand you correctly this is just atac_peaks.tile(tile_size).count_overlaps(another_interval_set)?

Mmm, not quite. As another_interval_set is associated to a count matrix (possibly thousands of values for each interval), one needs (naively):

  • create a new matrix with size cell x number of tiles
  • know which interval fall into each tile (and their index in the original matrix)
  • sum values at given indexes to fill each new column in the new matrix

I have done this before but not using PyRanges, I'm a bit concerned about performances: for a human sample you have +600k tiles (at 5kb) and ten times more elements in the another_interval_set. Matrix operations, on the other side, are quite efficient as they are typically performed on csr_matrix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/anndata/issues/624#issuecomment-948412055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUQNJHJEK3TRAXCQAGLUH7KDFANCNFSM5FEUOBQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

endrebak avatar Oct 25 '21 10:10 endrebak

thanks for reading through that wall of text :) it's an interesting point that subsetting could be done in pure numpy. In principle, this could also be done brute-force for multiple intervals, couldn't it? Then the question would be at which point switching to NCLS becomes more performant, considering also that the PyRanges object may have to be created on demand for this operation, if it doesn't exist yet because it was needed earlier for another operation. I am only referring to subsetting here - for more complex stuff like groupby-agg based on interval overlap, I think going directly to pyranges/ncls is likely the best option.

stephenkraemer avatar Oct 26 '21 08:10 stephenkraemer

Sorry for the late response on this! Lots going on over here, and I've been trying to get a better idea of how people work with this kind of data. I'd like to loop in @gtca and @mffrank, who have done more work with this kind of data in the muon.atac module.

I think it makes a ton of sense for there to be good integration between anndata and some genomic range formats. I'm not sure I see this as something that would necessarily be added to anndata, but as a composition of associated AnnData/ MuData object and some ranges. This is a bit like:

Spontaneously i think it belongs into uns.


And automatically loading pyranges objects from h5ad could add quite some memory overhead which may not be needed in the next analysis session.

I think the general direction ArchR has gone in, with an array based file format, seems like a good idea. Are you doing any work with out-of-core functionality in pyranges?

ivirshup avatar Jan 21 '22 13:01 ivirshup

Hi, great to hear from you! No worries about the delay, I can imagine there is lots going on. I am in principle still very interested in contributing to any effort in this direction. Right now I am also in the middle of something, so it might take me some weeks before I can prioritize this issue again. I am very interested in hearing from the muon guys, maybe there is already something in the works there.

stephenkraemer avatar Jan 24 '22 08:01 stephenkraemer

Updating this issue with a possible alternative: bioframe (docs, github)

Bioframe seems like an appealing way to work with genomic ranges in anndata because we wouldn't actually have to do anything. Instead of using an indexed structure, bioframe just works on pandas dataframes, so long as they have columns that can be mapped to chromosome, start, and stop for each contig. Similar to scverse packages, the API is otherwise largely functions.

Some cool things about this library:

  • Great documentation
  • Developed by open2c, another consortium of people working on omics data in python
  • Doesn't really define any classes to use

I think the biggest downside compared to pyranges is performance. However, I don't think this is so big an issue when we're dealing with use cases similar to RangedSummarizedExperiment, since the number of ranges we work with doesn't get very large. In addition pyranges requires sorted ranges with an index built on top of them.

ivirshup avatar Nov 28 '22 18:11 ivirshup

Super cool find Isaac, I didn't know about it! It'd be great to support it somehow. Then we can make queries like remove intervals overlapping with genomic regions with low mappability etc.

gokceneraslan avatar Nov 29 '22 20:11 gokceneraslan

The great thing is that we don't really need to support it!

Setting up the AnnData with genomic ranges
import scanpy as sc, pandas as pd

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.var = pbmc.var.rename_axis(index="hgnc_symbol")

gene_positions = (
    sc.queries.biomart_annotations(
        org="hsapiens",
        attrs=["hgnc_symbol", "chromosome_name", "start_position", "end_position"],
        use_cache=True,
    )
    .astype(
        {"chromosome_name": "category", "start_position": pd.Int64Dtype(), "end_position": pd.Int64Dtype()}
    )
)

new_var = (
    pbmc.var
    # Merging
    .join(gene_positions.set_index("hgnc_symbol"), how="left")
    # Dropping duplicated rows
    .reset_index()
    .drop_duplicates("hgnc_symbol")
    .set_index("hgnc_symbol")
    # Setting original order
    .loc[pbmc.var_names]
)

With an AnnData that has genomic range information like:

pbmc.var.iloc[10:15]
              n_cells chromosome_name  start_position  end_position
hgnc_symbol                                                        
RP11-54O7.11        4             NaN            <NA>          <NA>
ISG15            1206               1         1001138       1014540
AGRN                8               1         1020120       1056118
C1orf159           24               1         1081818       1116361
TNFRSF18           92               1         1203508       1206592

We can do:

import bioframe
bed_column_names = ("chromosome_name", "start_position", "end_position")

query_result = bioframe.select(pbmc.var, "4:0-1000000", cols=bed_column_names)
query_result.head()
             n_cells chromosome_name  start_position  end_position
hgnc_symbol                                                       
ZNF595            58               4           53286         88208
ZNF718            18               4          124501        202303
ZNF141            27               4          337814        384868
ZNF721           112               4          425815        499156
PIGG              63               4          499210        540200

And subset the original AnnData with:

pbmc[:, query_result.index]

And maybe be able to get to the indexing even faster in the future:

  • open2c/bioframe#128

ivirshup avatar Dec 05 '22 20:12 ivirshup

@gokceneraslan, you may also be interested in some functionality from the (currently in development) chame package by @gtca, specifically the chame.pp.filter_by_ranges function.

ivirshup avatar Dec 11 '22 15:12 ivirshup

I'm late to the party, Thanks for @gokceneraslan for pointing me to this thread.

I've been working on bringing Bioconductor data structures to Python especially those that have a strong support for genomic based analysis. I also openned this post this morning - https://github.com/scverse/anndata/issues/952 and would love to find ways to collaborate to make GenomicRanges compatible with AnnData!

jkanche avatar Mar 17 '23 02:03 jkanche

We are going to discuss genomic ranges support in the upcoming scverse community meeting (2023-04-18 18:00 CEST). For anyone interested, feel free to chime in, we want to learn about as many use-cases as possible.

Calendar invite + zoom link are available from the meeting notes.

grst avatar Apr 12 '23 06:04 grst