kma icon indicating copy to clipboard operation
kma copied to clipboard

Pre-processing with generate_introns.py failing

Open skasowitz opened this issue 8 years ago • 7 comments

When generate_introns.py begins grouping transcripts by gene I receive the following:

Traceback (most recent call last):
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 154, in <module>
    main()
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 104, in main
    g2t = intron_ops.reduce_to_gene(gtf_list)
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/intron_ops.py", line 45, in reduce_to_gene
    cur_gene = trans.gene_id
AttributeError: 'NoneType' object has no attribute 'gene_id'

I am using genome and annotations from Ensembl. Looking back at the previous part of this processing I see an occassional line reading 'NoneType' object has no attribute 'add_exon' which I assume are the same objects causing the intron_ops exit.

skasowitz avatar Apr 05 '16 14:04 skasowitz

I have the same issue using danRer10 Ensembl annotation and genome. The working example provided in the vignette worked fine with my settings.

slebedeva avatar Apr 13 '16 13:04 slebedeva

Similar issue here. Is it possible to provide us (or describe how to get) the annotation that you used in this publication http://nar.oxfordjournals.org/content/44/2/838.long ? Thank you in advance.

fgypas avatar Apr 20 '16 09:04 fgypas

Basically, same exact problem trying to use Arabidopsis TAIR10/Araport11 data

jthuff avatar Apr 27 '16 22:04 jthuff

In my case at the end I used the gencode annotation where I renamed the genome file to contain only the chromosome names without any suffixes.

For example in this case: ">chr1 1" i just kept ">chr1"

fgypas avatar Apr 28 '16 17:04 fgypas

In my case at the end I used the gencode annotation where I renamed the genome file to contain only the chromosome names without any suffixes.

Thanks for the suggestion! I got rid of extra stuff on each chromosome > line as @fgypas suggests, i.e. >Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02 to >Chr1

But it still didn't work...

Then I tried switching around the transcript_id and gene_id fields in column 9 of the GTF, i.e. Chr1 Araport11 5UTR 3631 3759 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"; to Chr1 Araport11 5UTR 3631 3759 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1";

And then it worked! Either having the extra annotations on the FASTA description lines or having transcript_id followed by gene_id in the GTF caused failure, but gave different errors. I finally got it to work by both cleaning up the FASTA description lines (as @fgypas suggests) AND swapping the GTF column 9 fields to gene_id followed by transcript_id.

jthuff avatar Apr 28 '16 20:04 jthuff

This is caused by a failure of gtf_parser.py to correctly process Ensembl-formatted GTFs (or GTFs in general). The attribute field of GTFs has no inherent order, but gtf_parser.py is assuming an order. This can be fixed by replacing this code block around line 220:

            transcript_id = (                                                                                                                
                gtf_line[8].split(";")[1].split(" ")[2].replace(                                                                             
                    "\"",                                                                                                                    
                    ""))  

with this:

            for attr in gtf_line[8].split(";"):                                                                                               
                attr_name = attr.strip().split(" ")[0]                                                                                        
                if attr_name == "transcript_id":                                                                                              
                    transcript_id = attr.strip().split(" ")[1].replace(                                                                       
                        "\"",                                                                                                                 
                        "") 
                    break

A similar fix can be applied later in gtf_parser.py to the "gene_id" attribute using the same code block but replacing the attr_name test with "gene_id" instead of "transcript_id".

stephenfloor avatar Nov 17 '16 20:11 stephenfloor

thanks, @stephenfloor ! I'm going to be making a bunch of changes in the coming days and hopefully fixing most of the reported bugs.

really sorry to those who have been waiting for a while. I've been quite busy with other projects.

pimentel avatar Dec 05 '16 19:12 pimentel