kma
kma copied to clipboard
Pre-processing with generate_introns.py failing
When generate_introns.py begins grouping transcripts by gene I receive the following:
Traceback (most recent call last):
File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 154, in <module>
main()
File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 104, in main
g2t = intron_ops.reduce_to_gene(gtf_list)
File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/intron_ops.py", line 45, in reduce_to_gene
cur_gene = trans.gene_id
AttributeError: 'NoneType' object has no attribute 'gene_id'
I am using genome and annotations from Ensembl. Looking back at the previous part of this processing I see an occassional line reading
'NoneType' object has no attribute 'add_exon'
which I assume are the same objects causing the intron_ops exit.
I have the same issue using danRer10 Ensembl annotation and genome. The working example provided in the vignette worked fine with my settings.
Similar issue here. Is it possible to provide us (or describe how to get) the annotation that you used in this publication http://nar.oxfordjournals.org/content/44/2/838.long ? Thank you in advance.
Basically, same exact problem trying to use Arabidopsis TAIR10/Araport11 data
In my case at the end I used the gencode annotation where I renamed the genome file to contain only the chromosome names without any suffixes.
For example in this case: ">chr1 1" i just kept ">chr1"
In my case at the end I used the gencode annotation where I renamed the genome file to contain only the chromosome names without any suffixes.
Thanks for the suggestion! I got rid of extra stuff on each chromosome > line as @fgypas suggests, i.e.
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02
to
>Chr1
But it still didn't work...
Then I tried switching around the transcript_id and gene_id fields in column 9 of the GTF, i.e.
Chr1 Araport11 5UTR 3631 3759 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010";
to
Chr1 Araport11 5UTR 3631 3759 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1";
And then it worked! Either having the extra annotations on the FASTA description lines or having transcript_id followed by gene_id in the GTF caused failure, but gave different errors. I finally got it to work by both cleaning up the FASTA description lines (as @fgypas suggests) AND swapping the GTF column 9 fields to gene_id followed by transcript_id.
This is caused by a failure of gtf_parser.py to correctly process Ensembl-formatted GTFs (or GTFs in general). The attribute field of GTFs has no inherent order, but gtf_parser.py is assuming an order. This can be fixed by replacing this code block around line 220:
transcript_id = (
gtf_line[8].split(";")[1].split(" ")[2].replace(
"\"",
""))
with this:
for attr in gtf_line[8].split(";"):
attr_name = attr.strip().split(" ")[0]
if attr_name == "transcript_id":
transcript_id = attr.strip().split(" ")[1].replace(
"\"",
"")
break
A similar fix can be applied later in gtf_parser.py to the "gene_id" attribute using the same code block but replacing the attr_name test with "gene_id" instead of "transcript_id".
thanks, @stephenfloor ! I'm going to be making a bunch of changes in the coming days and hopefully fixing most of the reported bugs.
really sorry to those who have been waiting for a while. I've been quite busy with other projects.