spaln
spaln copied to clipboard
Extremely long introns.
Hi,
I found some extremly long introns in spaln results (when mapping protein sequences to genome), some are ~1Mb, is there a way to limit intron length when running splan?
Best, Kun
Dear Kun,
According to Refseq, the longest intron of human is longer than 1.2Mbp, so that an intron longer than 1Mb might not be illegal. However, very long introns can be an artefact, as in general, it is difficult to identify the first or last few introns when the corresponding coding exon is not long enough or highly divergent. Alternatively, your genome might have structural variation at the relevant region. Even in such cases, spaln try to find the most plausible ones up or down to IntronPrm.maxl bp apart from the end of the currently aligned position, which might cause the artefacts. You may have an idea about what happens by individually running spaln -O1 to see the alignment of a problematic case.
IntronPrm.maxl is defined as the 99% quantile of the intron-length distribution specified by the -T option, but currently it cannot be directly modulated by a command line option. Instead, -XQq can control the block level extension (q also specifies the q quantile of the intron-length distribution) from the core blocks (having block scores greater than a threshold). For example, you may try -XQ90 to reduce the extension of the genomic area outside the core region at the risk that true long introns may be missed.
Osamu,
Dear Osamu,
Thank you for your reply, and I will try -XQ option to see what happen.
Best regards, Kun
Hi Osamu,
How to find IntronPrm.maxl file? I want to edit 99% to a smaller value, some intron length is still abnormal.
Here are some statistics about mRNA and intron length generated by spaln.
Zja is a monocot plant I want to annotate. zja.evigene.okay.aa is amino acid sequences file (okay set) generated by evigene pipeline (http://arthropods.eugenes.org/EvidentialGene/evigene/, the input est set is RNA-bloom assembly), this aa set is good enough according to BUSCO results.
gffread -V means delete mRNAs which with in-frame stop codon.
In the following table, Zja.p1 means: spaln -t10 -M4 -Q7 -O0 -LS -ya012 -yX1 -d spaln zja.evigene.okay.aa > zja.evigene.okay.aa1.gff3 + gffread -V Zja.p2 means: spaln -t10 -XQ90 -M4 -Q7 -O0 -LS -ya012 -yX1 -d spaln zja.evigene.okay.aa > zja.evigene.okay.aa2.gff3 + gffread -V Zja.p3 means: spaln -t10 -XQ80 -M4 -Q7 -O0 -LS -ya012 -yX1 -d spaln zja.evigene.okay.aa > zja.evigene.okay.aa3.gff3 + gffread -V
As we can see, XQ90 worked, but the average intron length is still unacceptable comparing with other monocot plant.
The phenomenon is not only occured in monocot plants, but also in dicot plants and animals.
I want to use spaln results to train augustus, so I need to get more accurately results. Do you have some more suggestions?
Best, Kun
Dear Kun,
Sorry for the long period of not responding to your comment.
I have just updated a new version Ver2.4.12 (by mistakes I have already used Ver2.4.10 and Ver2.4.11, so that the version number is discontinuous). By carefully reexamining the codes, I found several potential causes of extremely long introns you reported. The most problematic was that the previous version tried to find remote terminal (first or last) exons without seeking for closer candidates. In the new version, this problem has been fixed. I have also remedied several problematic changes introduced in the course of version ups. As a whole, the new version successfully reduces the error rates upon several test runs. I hope these modification help you.
Although not explicitly documented, this version allows to read binary form of parameter files, such as Splice5.dat, CodePotTab.dat. In the next release, I will upload C++ programs and scripts to generate these files and to convert between text and binary files, when the genomic sequence and a set of transcript sequences are given.
Osamu,
Dear Osamu,
Thank you very much, I will try Ver2.4.12 and feadback results.
Best, Kun
Hi Osamu,
In fact, there is another problem according to above table, the single CDS gene number of spaln results is larger than expected even when the input protein sequences are not fragemented.
Best, Kun
Any updates or input on that issue? I tried your suggestions, like using -XQ paramter which does not seem to change the output. I also provided a parameter table for Dicotyledons, yet some proteins that are quite divergent align with introns greater than 4000bp, which is unusually long for the species I work with.
thank you, Gil
Dear Gil,
These days I have been involved in speeding up spaln. I haven’t made any change concerning this matter. Spaln does not set strict limitation on the maximum intron length. Although you can control it indirectly through limiting the distance between the core block(s) and furthest block, long intron can be predicted, for example, when homologous genes or pseudogenes are tandemly repeated on the genome. I am very happy if you send me a few examples of pair of genomic fragment and query sequences for which you saw long introns.
Osamu,
Thank you for the reply, we're indeed working with a tandemly repeated gene and possibly pseudogenes. I'm not sure how to limit block numbers in a constructive way. I'm happy to share the sequences, but not publicly. Perhaps by e-mail?
Gil
Dear Gil,
Please send me the genomic sequence that contains ‘predicted’ gene area (around -1kb upstream from the initiation codon to around +1kb downstream from the termination codon) together with the query sequence. If possible, more than one examples are helpful. Please send me an e-mail to [email protected] attached with an encrypted zip file.
As I am not yet fully recovered, it may take a time to find a solution.
Osamu,