annotatr
annotatr copied to clipboard
Refactor build annotation code
Components of this:
- A function to generate a mapping table of gene IDs, transcript IDs, and gene symbols:
build_mapping_table(
orgdb,
columns
)
- A function to append the annotation metadata:
append_annotation_metadata(
gr,
id_maps_df,
id = NULL,
type = NULL,
gene_id_cols = c(gr = 'GENEID', id_maps_df = 'ENTREZID'),
tx_id_col = 'TXNAME',
symbol_col = 'SYMBOL'
)
- A function to build annotations from data downloaded via URL. If
tx_id,gene_id, andsymbolare to be used, it would be necessary to get the data first, and write some code to create those vectors prior to using this function.
build_annotations_from_url(
url,
id,
tx_id,
gene_id,
symbol,
type
)
- A function build annotations from
AnnotationHub. There will be a function that does a single accession, and then the existingbuild_ah_annots()will handle multiple accessions if needed. As above, iftx_id,gene_id, andsymbolare to be used, it would be necessary to get the data first, and write some code to create those vectors prior to using this function.
build_annotations_from_annotation_hub(
ah_acc,
id,
tx_id,
gene_id,
symbol,
type
)
- A function to build annotations from any
TxDbobject and a set ofid_maps:
build_annotations_from_txdb(
txdb,
id_maps,
distal_promoter,
distal_start = 4000,
distal_end = 1000,
proximal_promoter,
proximal_upstream = 1000,
proximal_downstream = 0,
CDS,
5UTRs,
exons,
firstexons,
introns,
intronexonboundaries,
exonintronboundaries,
3UTRs,
intergenic
)
- Special cases for
build_annotations_from_txdb()where gene ID columns are notENSEMBL,REFSEQ, orENTREZID. There are also cases where.1are appended to gene names (Fly, I think). In other words, this is full of inconsistencies and edge cases, and we need a general solution to make this work.- Aradopsis uses a
TAIRcolumn - C. elegans uses a
WORMBASEcolumn - D. melanogaster uses a
FLYBASEcolumn
- Aradopsis uses a
- A function to build CpG-type annotations from any base GRanges object. The idea being that shores flank 2000bp from edges of islands, shelves flank 2000bp from shores, and interCGI is the between space.
build_cpg_annots(
genome,
islands_gr,
islands,
shores,
shelves,
interCGI
)
When finished, #21 and #30 should implicitly be complete.
Something to keep straight in trying to make a general solution is the different usages of TXID, GENEID, and TXNAME in the TxDb objects.
For instance, in creating a TxDb object from a GFF, sometimes the gene symbol ends up in the GENEID column instead of an Entrez ID, which creates problems in the current building code.