annotatr
annotatr copied to clipboard
Annotate from GTF
I use your annotation package, which is very practical and complete. We're developing a workflow supposed to work with many different species, so I would like to be able to start from any GTF file for gene annotation. I've seen that it is possible to use custom annotations, but I'm a bit lost in the different functions to do so. I though to make my own TxDb using makeTxDbFromGFF, but I don't know how to create the custom annotations then.
I would also need to add annotations of CGIs which I have as a bed file. How do I combine those pieces of information to use build_annotations ?
Thank you in advance for your help.
Magali
Hi,
Thank you for the kind words. You're on the right track to use makeTxDbFromGFF, in fact, I've done that before for custom annotations on novel genomes. Perhaps the code I provide here will help you work that out in your case(s).
The reality is that build_annotations() isn't the only way to get a custom annotation into annotatr. Really any GenomicRanges object with the following extra columns will work:
- id: some sort of internal unique identifier for the feature, you can pick this arbitrarily
- tx_id: some notion of transcript ID associated with the feature, it can be NA (and is for the built-in enhancers, for example)
- gene_id: some notion of gene ID (we often use Entrez Gene IDs, but you could use EMSEMBL gene IDs or FlyBase IDs, etc)
- symbol: some notion of gene symbol
- type: something of the form
{genome}_{annotation_category}_{annotation_subcategory}for downstream use in summarization and plotting
The code I used, and which may work as a template for you:
library(Biostrings)
library(GenomicFeatures)
library(GenomicRanges)
genome = 'ARS-UI-Ramb-v2.0'
# Read in the FASTA to determine chromosome sizes
fasta = readDNAStringSet(filepath = '/nfs/turbo/epicore-active/genomes/Ovis_aries/ARS-UI-Ramb-v2.0/genome.fa')
# Split the scaffold names by space and take the first result, this was peculiar to this FASTA, your mileage may vary
fixed_scaffold_names = sapply(names(fasta), function(s){strsplit(s, ' ', fixed = TRUE)[[1]][1]}, USE.NAMES = FALSE)
# Check lengths are equal
length(fasta) == length(fixed_scaffold_names)
# Replace the names
names(fasta) = fixed_scaffold_names
ovisaries_seqinfo = Seqinfo(
seqnames = names(fasta),
seqlengths = width(fasta),
isCircular = c(rep(FALSE, times = length(fasta) -1), TRUE), # the last is mitochondrial
genome = 'ARS-UI-Ramb-v2.0')
################################################################################
# Create the chrominfo data.frame for use with makeTxDbFromGFF
ovisaries_chrominfo = data.frame(
chrom = seqlevels(ovisaries_seqinfo),
length = seqlengths(ovisaries_seqinfo),
is_circular = isCircular(ovisaries_seqinfo),
stringsAsFactors = F)
# Create txdb object from the GFF file
txdb = makeTxDbFromGFF(
file = '/nfs/turbo/epicore-active/genomes/Ovis_aries/ARS-UI-Ramb-v2.0/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.gff',
dataSource = 'https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ovis_aries/104/-annotation/',
organism = 'Ovis aries',
chrominfo = ovisaries_chrominfo)
########################################
# Create base transcripts
tx_gr = transcripts(txdb, columns = c('TXID','GENEID','TXNAME'))
genome(tx_gr) = genome
########################################
# Create the promoters
promoters_gr = GenomicFeatures::promoters(txdb, upstream = 1000, downstream = 0)
genome(promoters_gr) = genome
promoters_gr = GenomicRanges::trim(promoters_gr)
# Add Entrez ID, symbol, and type
GenomicRanges::mcols(promoters_gr)$gene_id = promoters_gr$tx_name
GenomicRanges::mcols(promoters_gr)$symbol = NA
GenomicRanges::mcols(promoters_gr)$type = sprintf('%s_genes_promoters', genome)
GenomicRanges::mcols(promoters_gr)$id = paste0('promoter:', seq_along(promoters_gr))
GenomicRanges::mcols(promoters_gr) = GenomicRanges::mcols(promoters_gr)[, c('id','tx_name','gene_id','symbol','type')]
colnames(GenomicRanges::mcols(promoters_gr)) = c('id','tx_id','gene_id','symbol','type')
########################################
# Create the exons
exons_grl = GenomicFeatures::exonsBy(txdb, by = 'tx', use.names = TRUE)
genome(exons_grl) = genome
# Create Rle of the tx_names
exons_txname_rle = S4Vectors::Rle(names(exons_grl), S4Vectors::elementNROWS(exons_grl))
exons_txname_vec = as.character(exons_txname_rle)
# Unlist and add the tx_names
exons_gr = unlist(exons_grl, use.names = FALSE)
GenomicRanges::mcols(exons_gr)$tx_name = exons_txname_vec
# Add Entrez ID, symbol, and type
GenomicRanges::mcols(exons_gr)$gene_id = exons_gr$tx_name
GenomicRanges::mcols(exons_gr)$symbol = NA
GenomicRanges::mcols(exons_gr)$type = sprintf('%s_genes_exons', genome)
GenomicRanges::mcols(exons_gr)$id = paste0('exon:', seq_along(exons_gr))
GenomicRanges::mcols(exons_gr) = GenomicRanges::mcols(exons_gr)[, c('id','tx_name','gene_id','symbol','type')]
colnames(GenomicRanges::mcols(exons_gr)) = c('id','tx_id','gene_id','symbol','type')
########################################
# Create the introns
introns_grl = GenomicFeatures::intronsByTranscript(txdb, use.names = TRUE)
genome(introns_grl) = genome
# Create Rle of the tx_names
introns_txname_rle = S4Vectors::Rle(names(introns_grl), S4Vectors::elementNROWS(introns_grl))
introns_txname_vec = as.character(introns_txname_rle)
# Unlist and add the tx_names
introns_gr = unlist(introns_grl, use.names = FALSE)
GenomicRanges::mcols(introns_gr)$tx_name = introns_txname_vec
# NOTE: here we match on the tx_name because the tx_id is not given
GenomicRanges::mcols(introns_gr)$gene_id = introns_gr$tx_name
GenomicRanges::mcols(introns_gr)$symbol = NA
GenomicRanges::mcols(introns_gr)$type = sprintf('%s_genes_introns', genome)
GenomicRanges::mcols(introns_gr)$id = paste0('intron:', seq_along(introns_gr))
GenomicRanges::mcols(introns_gr) = GenomicRanges::mcols(introns_gr)[, c('id','tx_name','gene_id','symbol','type')]
colnames(GenomicRanges::mcols(introns_gr)) = c('id','tx_id','gene_id','symbol','type')
########################################
# Create the intergenic
genic_gr = c(GenomicRanges::granges(tx_gr), GenomicRanges::granges(promoters_gr))
GenomicRanges::strand(genic_gr) = '*'
intergenic_gr = GenomicRanges::reduce(genic_gr)
intergenic_gr = GenomicRanges::gaps(intergenic_gr)
# A quirk in gaps gives the entire + and - strand of a chromosome, ignore those
intergenic_gr = intergenic_gr[GenomicRanges::strand(intergenic_gr) == '*']
GenomicRanges::mcols(intergenic_gr)$id = paste0('intergenic:', seq_along(intergenic_gr))
GenomicRanges::mcols(intergenic_gr)$tx_id = NA
GenomicRanges::mcols(intergenic_gr)$gene_id = NA
GenomicRanges::mcols(intergenic_gr)$symbol = NA
GenomicRanges::mcols(intergenic_gr)$type = sprintf('%s_genes_intergenic', genome)
Hope this helps, Raymond
Thank you so much for your answer!! I will try that! Best wishes,
Magali
You're welcome. And, one more note, depending on the complexity of the GTF/GFF, you may not be able to pull out introns and exons. You're basically stuck with the descriptive resolution of the file, which very well might be promoters and gene bodies.
Note you're free to define promoters in any way you wish with the upstream and downstream parameters of GenomicFeatures::promoters().