genomic-features icon indicating copy to clipboard operation
genomic-features copied to clipboard

What order should results be in?

Open ivirshup opened this issue 1 year ago • 7 comments

Description of feature

Continuing from https://github.com/scverse/genomic-features/pull/59#issuecomment-2034149437

In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.

To solve this @thomas-reimonn added an order_by statement here

Right now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by chrom and start.

We should also check what the bioc packages do here.

@nvictus, @emdann, @lauradmartens any thoughts?

ivirshup avatar Apr 03 '24 15:04 ivirshup

  • bedtools sort does chromosome then start
  • ensembldb
  • GenomicFeatures
    • Mentions order a bunch of different places in API docs, haven't full explored cases
    • Has parameters which control whether strand is considered when ordering exons

ivirshup avatar Apr 03 '24 15:04 ivirshup

Definitely default to genomic location (chr+start), option to use another column?

emdann avatar Apr 03 '24 15:04 emdann

Some of these tables don't have loci information. E.g. ibis.common.exceptions.IbisTypeError: Column 'chrom' is not found in table. Existing columns: 'gene_id', 'gene_name', 'gene_biotype', 'gene_seq_start', 'gene_seq_end', 'seq_name', 'seq_strand', 'seq_coord_system', 'description', 'gene_id_version', 'canonical_transcript'.

thomas-reimonn avatar Apr 03 '24 15:04 thomas-reimonn

@thomas-reimonn Ah, I think chrom should be seq_name

@emdann, for something like exons, where "gene_id" or "transcript_id" has been selected, so we still sort by chromosome and start? Or should we sort by the gene/ transcript start?

And then, do we consider strand for sorting exons within a transcript?

ivirshup avatar Apr 03 '24 16:04 ivirshup

It looks like tx2exon has exon_idx which I believe is the order of the exon inside the transcript

ivirshup avatar Apr 03 '24 16:04 ivirshup

I also agree by default it should be chrom + start. It seems like this is also what GenomicFeatures does and they have a separate function that returns ranges sorted by another value I think: e.g. transcriptsBy(txdb, "gene") and then the exons are sorted how they would appear in the transcript: In the manual on transcriptsBy/exonsBy:

These functions return a GRangesList object where the ranges within each of the elements are ordered according to the following rule: When using exonsBy or cdsBy with by="tx", the returned exons or CDS parts are ordered by ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the transcriptsBy ranges will be ordered by chromosome, strand, start, and end values.

lauradmartens avatar Apr 03 '24 17:04 lauradmartens

I think the bioc packages go by chrom + start for the "primary" thing being queried (e.g. genes for .genes). But it looks like there is some strand specific behaviour in both if you grab exons within a gene query. Here's where I got to:

Demo

Setup

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb = EnsDb.Hsapiens.v86

library(GenomicFeatures)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
                        package="GenomicFeatures")
txdb <- loadDb(samplefile)

ensembldb getting genes but also returning exons.

Only - strand:

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("-")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000227232        1 14404-29570      - |          29534        29570 ENSE00001890219 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          24738        24891 ENSE00003507205 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          18268        18366 ENSE00003477500 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17915        18061 ENSE00003565697 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17606        17742 ENSE00003475637 ENSG00000227232
  ENSG00000227232        1 14404-29570      - |          17233        17368 ENSE00003502542 ENSG00000227232
  -------
  seqinfo: 308 sequences (1 circular) from GRCh38 genome

+ strand

> head(genes(
    edb,
    columns=c("exon_seq_start", "exon_seq_end", "exon_id"),
    filter=SeqStrandFilter("+")
))

GRanges object with 6 ranges and 4 metadata columns:
                  seqnames      ranges strand | exon_seq_start exon_seq_end         exon_id         gene_id
                     <Rle>   <IRanges>  <Rle> |      <integer>    <integer>     <character>     <character>
  ENSG00000223972        1 11869-14409      + |          11869        12227 ENSE00002234944 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12721 ENSE00003582793 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          13221        14409 ENSE00002312635 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12010        12057 ENSE00001948541 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12179        12227 ENSE00001671638 ENSG00000223972
  ENSG00000223972        1 11869-14409      + |          12613        12697 ENSE00001758273 ENSG00000223972
  -------
  seqinfo: 335 sequences (1 circular) from GRCh38 genome

GenomicFeatures

> head(genes(
    txdb,
    columns=c("exon_start", "exon_end", "exon_id"),
    filter=list(tx_strand=c("-"))
))

  1 gene was dropped because it has exons located on both strands of the same reference sequence or on more than one reference sequence, so cannot be represented by a single genomic
  range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 6 ranges and 3 metadata columns:
         seqnames              ranges strand |                        exon_start                          exon_end         exon_id
            <Rle>           <IRanges>  <Rle> |                     <IntegerList>                     <IntegerList>   <IntegerList>
       1    chr19   58858172-58874214      - |    58864770,58864658,58864294,...    58864865,58864693,58864563,... 461,460,459,...
   10186    chr13   39917029-40177356      - |    40177020,40174969,39952565,...    40177356,40175527,39952663,... 347,346,345,...
  131669     chr3 126200008-126236616      - | 126236437,126229507,126228428,... 126236616,126229637,126228521,... 170,169,168,...
  139716     chrX 153903527-153979858      - | 153979229,153944301,153941482,... 153979348,153944604,153941698,... 564,563,561,...
  201725     chr4 159587827-159593407      - |     159592768,159587827,159593149 159593202,159590920,159593407,... 180,179,181,...
  221150    chr13   21727734-21750741      - |    21750514,21746759,21746478,...    21750741,21746820,21746643,... 343,342,341,...
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

ivirshup avatar Apr 03 '24 18:04 ivirshup