annotatr icon indicating copy to clipboard operation
annotatr copied to clipboard

Refactor build annotation code

Open rcavalcante opened this issue 5 years ago • 1 comments

Components of this:

  • A function to generate a mapping table of gene IDs, transcript IDs, and gene symbols:
build_mapping_table(
    orgdb,
    columns
)
  • A function to append the annotation metadata:
append_annotation_metadata(
    gr,
    id_maps_df,
    id = NULL,
    type = NULL,
    gene_id_cols = c(gr = 'GENEID', id_maps_df = 'ENTREZID'),
    tx_id_col = 'TXNAME',
    symbol_col = 'SYMBOL'
)
  • A function to build annotations from data downloaded via URL. If tx_id, gene_id, and symbol are to be used, it would be necessary to get the data first, and write some code to create those vectors prior to using this function.
build_annotations_from_url(
    url, 
    id, 
    tx_id, 
    gene_id, 
    symbol, 
    type
)
  • A function build annotations from AnnotationHub. There will be a function that does a single accession, and then the existing build_ah_annots() will handle multiple accessions if needed. As above, if tx_id, gene_id, and symbol are to be used, it would be necessary to get the data first, and write some code to create those vectors prior to using this function.
build_annotations_from_annotation_hub(
    ah_acc, 
    id, 
    tx_id, 
    gene_id, 
    symbol, 
    type
)
  • A function to build annotations from any TxDb object and a set of id_maps:
build_annotations_from_txdb(
    txdb, 
    id_maps, 
    distal_promoter, 
    distal_start = 4000, 
    distal_end = 1000, 
    proximal_promoter, 
    proximal_upstream = 1000, 
    proximal_downstream = 0,
    CDS, 
    5UTRs, 
    exons, 
    firstexons, 
    introns, 
    intronexonboundaries, 
    exonintronboundaries, 
    3UTRs,
    intergenic
)
  • Special cases for build_annotations_from_txdb() where gene ID columns are not ENSEMBL, REFSEQ, or ENTREZID. There are also cases where .1 are appended to gene names (Fly, I think). In other words, this is full of inconsistencies and edge cases, and we need a general solution to make this work.
    • Aradopsis uses a TAIR column
    • C. elegans uses a WORMBASE column
    • D. melanogaster uses a FLYBASE column
  • A function to build CpG-type annotations from any base GRanges object. The idea being that shores flank 2000bp from edges of islands, shelves flank 2000bp from shores, and interCGI is the between space.
build_cpg_annots(
    genome,
    islands_gr,
    islands,
    shores,
    shelves,
    interCGI
)

When finished, #21 and #30 should implicitly be complete.

rcavalcante avatar Apr 22 '20 11:04 rcavalcante

Something to keep straight in trying to make a general solution is the different usages of TXID, GENEID, and TXNAME in the TxDb objects.

For instance, in creating a TxDb object from a GFF, sometimes the gene symbol ends up in the GENEID column instead of an Entrez ID, which creates problems in the current building code.

rcavalcante avatar Apr 22 '20 11:04 rcavalcante