genomic-features What order should results be in?

Description of feature

Continuing from https://github.com/scverse/genomic-features/pull/59#issuecomment-2034149437

In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.

To solve this @thomas-reimonn added an order_by statement here

Right now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by chrom and start.

We should also check what the bioc packages do here.

@nvictus, @emdann, @lauradmartens any thoughts?

Apr 03 '24 15:04 ivirshup

bedtools sort does chromosome then start
ensembldb
- Old release notes say "Chrom then start"
- Also offers order.by argument
GenomicFeatures
- Mentions order a bunch of different places in API docs, haven't full explored cases
- Has parameters which control whether strand is considered when ordering exons

Apr 03 '24 15:04 ivirshup

Definitely default to genomic location (chr+start), option to use another column?

Apr 03 '24 15:04 emdann

Some of these tables don't have loci information. E.g. ibis.common.exceptions.IbisTypeError: Column 'chrom' is not found in table. Existing columns: 'gene_id', 'gene_name', 'gene_biotype', 'gene_seq_start', 'gene_seq_end', 'seq_name', 'seq_strand', 'seq_coord_system', 'description', 'gene_id_version', 'canonical_transcript'.

Apr 03 '24 15:04 thomas-reimonn

@thomas-reimonn Ah, I think chrom should be seq_name

@emdann, for something like exons, where "gene_id" or "transcript_id" has been selected, so we still sort by chromosome and start? Or should we sort by the gene/ transcript start?

And then, do we consider strand for sorting exons within a transcript?

Apr 03 '24 16:04 ivirshup

It looks like tx2exon has exon_idx which I believe is the order of the exon inside the transcript

Apr 03 '24 16:04 ivirshup

I also agree by default it should be chrom + start. It seems like this is also what GenomicFeatures does and they have a separate function that returns ranges sorted by another value I think: e.g. transcriptsBy(txdb, "gene") and then the exons are sorted how they would appear in the transcript: In the manual on transcriptsBy/exonsBy:

These functions return a GRangesList object where the ranges within each of the elements are ordered according to the following rule: When using exonsBy or cdsBy with by="tx", the returned exons or CDS parts are ordered by ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the transcriptsBy ranges will be ordered by chromosome, strand, start, and end values.

Apr 03 '24 17:04 lauradmartens

I think the bioc packages go by chrom + start for the "primary" thing being queried (e.g. genes for .genes). But it looks like there is some strand specific behaviour in both if you grab exons within a gene query. Here's where I got to:

Demo

Setup

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86

library(GenomicFeatures)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
                        package="GenomicFeatures")
txdb <- loadDb(samplefile)

`ensembldb` getting genes but also returning exons.

Only `-` strand:

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("-")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000227232        1 14404-29570      - |          29534        29570 ENSE00001890219 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          24738        24891 ENSE00003507205 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          18268        18366 ENSE00003477500 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17915        18061 ENSE00003565697 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17606        17742 ENSE00003475637 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17233        17368 ENSE00003502542 ENSG00000227232
  -------
  seqinfo: 308 sequences (1 circular) from GRCh38 genome

`+ strand`

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("+")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000223972        1 11869-14409      + |          11869        12227 ENSE00002234944 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12721 ENSE00003582793 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          13221        14409 ENSE00002312635 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12010        12057 ENSE00001948541 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12179        12227 ENSE00001671638 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12697 ENSE00001758273 ENSG00000223972
  -------
  seqinfo: 335 sequences (1 circular) from GRCh38 genome