annotatr icon indicating copy to clipboard operation
annotatr copied to clipboard

Annotate from GTF

Open hennion opened this issue 2 years ago • 3 comments

I use your annotation package, which is very practical and complete. We're developing a workflow supposed to work with many different species, so I would like to be able to start from any GTF file for gene annotation. I've seen that it is possible to use custom annotations, but I'm a bit lost in the different functions to do so. I though to make my own TxDb using makeTxDbFromGFF, but I don't know how to create the custom annotations then.

I would also need to add annotations of CGIs which I have as a bed file. How do I combine those pieces of information to use build_annotations ?

Thank you in advance for your help.

Magali

hennion avatar Feb 17 '23 11:02 hennion

Hi,

Thank you for the kind words. You're on the right track to use makeTxDbFromGFF, in fact, I've done that before for custom annotations on novel genomes. Perhaps the code I provide here will help you work that out in your case(s).

The reality is that build_annotations() isn't the only way to get a custom annotation into annotatr. Really any GenomicRanges object with the following extra columns will work:

  • id: some sort of internal unique identifier for the feature, you can pick this arbitrarily
  • tx_id: some notion of transcript ID associated with the feature, it can be NA (and is for the built-in enhancers, for example)
  • gene_id: some notion of gene ID (we often use Entrez Gene IDs, but you could use EMSEMBL gene IDs or FlyBase IDs, etc)
  • symbol: some notion of gene symbol
  • type: something of the form {genome}_{annotation_category}_{annotation_subcategory} for downstream use in summarization and plotting

The code I used, and which may work as a template for you:

library(Biostrings)
library(GenomicFeatures)
library(GenomicRanges)

genome = 'ARS-UI-Ramb-v2.0'

# Read in the FASTA to determine chromosome sizes
fasta = readDNAStringSet(filepath = '/nfs/turbo/epicore-active/genomes/Ovis_aries/ARS-UI-Ramb-v2.0/genome.fa')

# Split the scaffold names by space and take the first result, this was peculiar to this FASTA, your mileage may vary
fixed_scaffold_names = sapply(names(fasta), function(s){strsplit(s, ' ', fixed = TRUE)[[1]][1]}, USE.NAMES = FALSE)

# Check lengths are equal
length(fasta) == length(fixed_scaffold_names)

# Replace the names
names(fasta) = fixed_scaffold_names

ovisaries_seqinfo = Seqinfo(
    seqnames = names(fasta),
    seqlengths = width(fasta),
    isCircular = c(rep(FALSE, times = length(fasta) -1), TRUE), # the last is mitochondrial
    genome = 'ARS-UI-Ramb-v2.0')

################################################################################

# Create the chrominfo data.frame for use with makeTxDbFromGFF
ovisaries_chrominfo = data.frame(
    chrom = seqlevels(ovisaries_seqinfo),
    length = seqlengths(ovisaries_seqinfo),
    is_circular = isCircular(ovisaries_seqinfo),
    stringsAsFactors = F)

# Create txdb object from the GFF file
txdb = makeTxDbFromGFF(
    file = '/nfs/turbo/epicore-active/genomes/Ovis_aries/ARS-UI-Ramb-v2.0/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.gff',
    dataSource = 'https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ovis_aries/104/-annotation/',
    organism = 'Ovis aries',
    chrominfo = ovisaries_chrominfo)

########################################
# Create base transcripts
    tx_gr = transcripts(txdb, columns = c('TXID','GENEID','TXNAME'))
    genome(tx_gr) = genome

########################################
# Create the promoters
    promoters_gr = GenomicFeatures::promoters(txdb, upstream = 1000, downstream = 0)
    genome(promoters_gr) = genome
    promoters_gr = GenomicRanges::trim(promoters_gr)

    # Add Entrez ID, symbol, and type
    GenomicRanges::mcols(promoters_gr)$gene_id = promoters_gr$tx_name
    GenomicRanges::mcols(promoters_gr)$symbol = NA
    GenomicRanges::mcols(promoters_gr)$type = sprintf('%s_genes_promoters', genome)
    GenomicRanges::mcols(promoters_gr)$id = paste0('promoter:', seq_along(promoters_gr))

    GenomicRanges::mcols(promoters_gr) = GenomicRanges::mcols(promoters_gr)[, c('id','tx_name','gene_id','symbol','type')]
    colnames(GenomicRanges::mcols(promoters_gr)) = c('id','tx_id','gene_id','symbol','type')

########################################
# Create the exons
    exons_grl = GenomicFeatures::exonsBy(txdb, by = 'tx', use.names = TRUE)
    genome(exons_grl) = genome

    # Create Rle of the tx_names
    exons_txname_rle = S4Vectors::Rle(names(exons_grl), S4Vectors::elementNROWS(exons_grl))
    exons_txname_vec = as.character(exons_txname_rle)

    # Unlist and add the tx_names
    exons_gr = unlist(exons_grl, use.names = FALSE)
    GenomicRanges::mcols(exons_gr)$tx_name = exons_txname_vec

    # Add Entrez ID, symbol, and type
    GenomicRanges::mcols(exons_gr)$gene_id = exons_gr$tx_name
    GenomicRanges::mcols(exons_gr)$symbol = NA
    GenomicRanges::mcols(exons_gr)$type = sprintf('%s_genes_exons', genome)
    GenomicRanges::mcols(exons_gr)$id = paste0('exon:', seq_along(exons_gr))

    GenomicRanges::mcols(exons_gr) = GenomicRanges::mcols(exons_gr)[, c('id','tx_name','gene_id','symbol','type')]
    colnames(GenomicRanges::mcols(exons_gr)) = c('id','tx_id','gene_id','symbol','type')

########################################
# Create the introns
    introns_grl = GenomicFeatures::intronsByTranscript(txdb, use.names = TRUE)
    genome(introns_grl) = genome

    # Create Rle of the tx_names
    introns_txname_rle = S4Vectors::Rle(names(introns_grl), S4Vectors::elementNROWS(introns_grl))
    introns_txname_vec = as.character(introns_txname_rle)

    # Unlist and add the tx_names
    introns_gr = unlist(introns_grl, use.names = FALSE)
    GenomicRanges::mcols(introns_gr)$tx_name = introns_txname_vec

    # NOTE: here we match on the tx_name because the tx_id is not given
    GenomicRanges::mcols(introns_gr)$gene_id = introns_gr$tx_name
    GenomicRanges::mcols(introns_gr)$symbol = NA
    GenomicRanges::mcols(introns_gr)$type = sprintf('%s_genes_introns', genome)
    GenomicRanges::mcols(introns_gr)$id = paste0('intron:', seq_along(introns_gr))

    GenomicRanges::mcols(introns_gr) = GenomicRanges::mcols(introns_gr)[, c('id','tx_name','gene_id','symbol','type')]
    colnames(GenomicRanges::mcols(introns_gr)) = c('id','tx_id','gene_id','symbol','type')

########################################
# Create the intergenic
    genic_gr = c(GenomicRanges::granges(tx_gr), GenomicRanges::granges(promoters_gr))
    GenomicRanges::strand(genic_gr) = '*'
    intergenic_gr = GenomicRanges::reduce(genic_gr)
    intergenic_gr = GenomicRanges::gaps(intergenic_gr)

    # A quirk in gaps gives the entire + and - strand of a chromosome, ignore those
    intergenic_gr = intergenic_gr[GenomicRanges::strand(intergenic_gr) == '*']

    GenomicRanges::mcols(intergenic_gr)$id = paste0('intergenic:', seq_along(intergenic_gr))
    GenomicRanges::mcols(intergenic_gr)$tx_id = NA
    GenomicRanges::mcols(intergenic_gr)$gene_id = NA
    GenomicRanges::mcols(intergenic_gr)$symbol = NA
    GenomicRanges::mcols(intergenic_gr)$type = sprintf('%s_genes_intergenic', genome)

Hope this helps, Raymond

rcavalcante avatar Feb 21 '23 17:02 rcavalcante

Thank you so much for your answer!! I will try that! Best wishes,

Magali

hennion avatar Feb 21 '23 17:02 hennion

You're welcome. And, one more note, depending on the complexity of the GTF/GFF, you may not be able to pull out introns and exons. You're basically stuck with the descriptive resolution of the file, which very well might be promoters and gene bodies.

Note you're free to define promoters in any way you wish with the upstream and downstream parameters of GenomicFeatures::promoters().

rcavalcante avatar Feb 21 '23 17:02 rcavalcante