IsoformSwitchAnalyzeR Get annotation from TxDb?

I work for a biotech center where we get a lot of projects in honeybee, so we have our own custom GTF and transcriptome for Apis mellifera that includes several common viral RNAs. (Very frequently we find that more than half the RNA from a bee sample is just virus! Somehow the bee is ok with that.) I kept getting errors when trying to import our GTF with importRdata. I tried debugging it by going into your package code, and I think part of the problem is from CDS and exons not always having a value in the transcript_id column, but instead just pointing back to the transcript using the Parent column. That still didn't fix it though, and eventually I gave up and used the GFF from NCBI. (There aren't annotated alternative isoforms for the viral RNAs anyway.)

I'm wondering if, instead of using rtracklayer::import to import a GTF to GRanges, importRdata or importGTF could use the TxDb class from the GenomicFeatures package. Functions from that package like transcriptsBy, exonsBy, and cdsBy could simplify a lot of the complex code that you have written. Import from GTF to TxDb is pretty straightforward with makeTxDbFromGFF. Moreover, those of us who use Bioconductor a lot might already have our annotations stored as TxDb, and it would be convenient to pass that object directly to importRdata rather than having to read the file again.

Jul 09 '20 15:07 lvclark

Thanks for reaching out.

It is an excellent idea to also support TxDb. I will put it on the enhancement list :)

With regards to your GTF problems it sound like you have a GFF file and not a GTF fil (only GFF files have the Parent column). You could try converting the GFF file to a GTF file via tools such as gffread by running it as follows:

gffread a_gff_file.gff3 -T -o a_gtf_file.gtf

Cheers Kristoffer

Jul 09 '20 15:07 kvittingseerup

I was just looking into this but txdb seems not to import gene_names? Do you know if there Is there a generalisable trick to import them as well?

Sep 10 '20 09:09 kvittingseerup

Well, if the user has a TxDb then they probably already know something about Bioconductor and R, so maybe in that case they should also provide (via an additional argument) a named vector that translates from gene id's to gene names or symbols. They could obtain that from their OrgDb package, or manually by working with the annotation file in rtracklayer, or some other method depending on where they're starting from.

I guess my overall philosophy is, make sure there are options both for beginner and advanced users.

I had been thinking of taking this issue on for Hacktoberfest if you weren't in a hurry to do it yourself. But if you are taking care of it then I'll leave it to you! (Hacktoberfest = make four pull requests in October and get a free t-shirt, but I am not totally sure I'll have time for it.)

Sep 10 '20 13:09 lvclark

Thanks for the ideas.

ounds like a great idea to add gene_names to TxDB (the ensembledb package TxDB does have them so maybe one could borrow something from there?). Unfortunately I will not have the time to do it myself but if you end up finding time dont hesitate to let me know :-)

Sep 11 '20 07:09 kvittingseerup

I'm working now on an importTxDb function. Is it critical to fill in the gene_biotype column of the isoformFeatures table?

Sep 19 '20 17:09 lvclark

That sounds very interesting. Gene biotype is a nice to have but not a need to have 😊

Cheers Kristoffer

lør. 19. sep. 2020 kl. 19.37 skrev Lindsay Clark [email protected]:

I'm working now on an importTxDb function. Is it critical to fill in the gene_biotype column of the isoformFeatures table?

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/kvittingseerup/IsoformSwitchAnalyzeR/issues/76#issuecomment-695336216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFU7JYZFLNAYVR2F36HK7LLSGTT7BANCNFSM4OVWFMWQ .

Sep 21 '20 06:09 kvittingseerup

How did Hacktoberfest go? Did you find some time to do it? And if so did you end up working on a function for IsoformSwitchAnalyzeR or an update for the TxDB class?

Oct 30 '20 10:10 kvittingseerup

I'm not going to finish Hacktoberfest and get my t-shirt, but I did make a lot of progress on an importTxDb function for IsoformSwitchAnalyzeR, and hopefully I can finish is up in the next month or so. I'm not going to try to update the TxDb class, given that according to that forum link you shared, it seems like the GenomicFeatures developers had a reason for not including gene names. Instead my function has an extra argument to let the user add a dictionary of gene ids to gene names. I'll provide a tutorial on ways to create such a dictionary.

You can see what I've done so far at the bottom of the file here: https://github.com/lvclark/IsoformSwitchAnalyzeR/blob/master/R/import_data.R

At this point it's building the isoformFeatures and orfAnalysis tables, as well as the exons GRanges object. The nice thing is that the CDS annotated in TxDb objects always include the stop codon, whether the TxDb was imported from GTF or GFF, and so I was able to have my function always trim the stop codon off.

I still need to provide functionality for a lot of the arguments that I copied over from importGTF, like extractAaSeq, ignoreAfterBar, removeFusionTranscripts, etc. I also need to thoroughly compare the output to that of importGTF and see if any debugging is necessary where they differ. Lastly I need to document the function, and check that the whole pipeline works if I start from TxDb. I wasn't going to edit importRdata because I assume you would want control over exactly if and how that was done.

Oct 30 '20 14:10 lvclark

Thanks for the effort and update - sounds really useful :-)

Nov 06 '20 08:11 kvittingseerup

IsoformSwitchAnalyzeR IsoformSwitchAnalyzeR copied to clipboard

Get annotation from TxDb?

IsoformSwitchAnalyzeR
IsoformSwitchAnalyzeR copied to clipboard