IsoformSwitchAnalyzeR
IsoformSwitchAnalyzeR copied to clipboard
Get annotation from TxDb?
I work for a biotech center where we get a lot of projects in honeybee, so we have our own custom GTF and transcriptome for Apis mellifera that includes several common viral RNAs. (Very frequently we find that more than half the RNA from a bee sample is just virus! Somehow the bee is ok with that.) I kept getting errors when trying to import our GTF with importRdata
. I tried debugging it by going into your package code, and I think part of the problem is from CDS and exons not always having a value in the transcript_id
column, but instead just pointing back to the transcript using the Parent
column. That still didn't fix it though, and eventually I gave up and used the GFF from NCBI. (There aren't annotated alternative isoforms for the viral RNAs anyway.)
I'm wondering if, instead of using rtracklayer::import
to import a GTF to GRanges
, importRdata
or importGTF
could use the TxDb
class from the GenomicFeatures
package. Functions from that package like transcriptsBy
, exonsBy
, and cdsBy
could simplify a lot of the complex code that you have written. Import from GTF to TxDb
is pretty straightforward with makeTxDbFromGFF
. Moreover, those of us who use Bioconductor a lot might already have our annotations stored as TxDb
, and it would be convenient to pass that object directly to importRdata
rather than having to read the file again.
Thanks for reaching out.
It is an excellent idea to also support TxDb. I will put it on the enhancement list :)
With regards to your GTF problems it sound like you have a GFF file and not a GTF fil (only GFF files have the Parent
column). You could try converting the GFF file to a GTF file via tools such as gffread by running it as follows:
gffread a_gff_file.gff3 -T -o a_gtf_file.gtf
Cheers Kristoffer
I was just looking into this but txdb seems not to import gene_names? Do you know if there Is there a generalisable trick to import them as well?
Well, if the user has a TxDb
then they probably already know something about Bioconductor and R, so maybe in that case they should also provide (via an additional argument) a named vector that translates from gene id's to gene names or symbols. They could obtain that from their OrgDb package, or manually by working with the annotation file in rtracklayer
, or some other method depending on where they're starting from.
I guess my overall philosophy is, make sure there are options both for beginner and advanced users.
I had been thinking of taking this issue on for Hacktoberfest if you weren't in a hurry to do it yourself. But if you are taking care of it then I'll leave it to you! (Hacktoberfest = make four pull requests in October and get a free t-shirt, but I am not totally sure I'll have time for it.)
Thanks for the ideas.
ounds like a great idea to add gene_names to TxDB (the ensembledb package TxDB does have them so maybe one could borrow something from there?). Unfortunately I will not have the time to do it myself but if you end up finding time dont hesitate to let me know :-)
I'm working now on an importTxDb
function. Is it critical to fill in the gene_biotype
column of the isoformFeatures
table?
That sounds very interesting. Gene biotype is a nice to have but not a need to have 😊
Cheers Kristoffer
lør. 19. sep. 2020 kl. 19.37 skrev Lindsay Clark [email protected]:
I'm working now on an importTxDb function. Is it critical to fill in the gene_biotype column of the isoformFeatures table?
— You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub https://github.com/kvittingseerup/IsoformSwitchAnalyzeR/issues/76#issuecomment-695336216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFU7JYZFLNAYVR2F36HK7LLSGTT7BANCNFSM4OVWFMWQ .
How did Hacktoberfest go? Did you find some time to do it? And if so did you end up working on a function for IsoformSwitchAnalyzeR or an update for the TxDB class?
I'm not going to finish Hacktoberfest and get my t-shirt, but I did make a lot of progress on an importTxDb
function for IsoformSwitchAnalyzeR, and hopefully I can finish is up in the next month or so. I'm not going to try to update the TxDb
class, given that according to that forum link you shared, it seems like the GenomicFeatures developers had a reason for not including gene names. Instead my function has an extra argument to let the user add a dictionary of gene ids to gene names. I'll provide a tutorial on ways to create such a dictionary.
You can see what I've done so far at the bottom of the file here: https://github.com/lvclark/IsoformSwitchAnalyzeR/blob/master/R/import_data.R
At this point it's building the isoformFeatures
and orfAnalysis
tables, as well as the exons
GRanges
object. The nice thing is that the CDS annotated in TxDb
objects always include the stop codon, whether the TxDb
was imported from GTF or GFF, and so I was able to have my function always trim the stop codon off.
I still need to provide functionality for a lot of the arguments that I copied over from importGTF
, like extractAaSeq
, ignoreAfterBar
, removeFusionTranscripts
, etc. I also need to thoroughly compare the output to that of importGTF
and see if any debugging is necessary where they differ. Lastly I need to document the function, and check that the whole pipeline works if I start from TxDb
. I wasn't going to edit importRdata
because I assume you would want control over exactly if and how that was done.
Thanks for the effort and update - sounds really useful :-)