spaln icon indicating copy to clipboard operation
spaln copied to clipboard

Extremely long introns.

Open xiekunwhy opened this issue 2 years ago • 6 comments

Hi,

I found some extremly long introns in spaln results (when mapping protein sequences to genome), some are ~1Mb, is there a way to limit intron length when running splan?

image

Best, Kun

xiekunwhy avatar May 31 '22 08:05 xiekunwhy

Dear Kun,

According to Refseq, the longest intron of human is longer than 1.2Mbp, so that an intron longer than 1Mb might not be illegal. However, very long introns can be an artefact, as in general, it is difficult to identify the first or last few introns when the corresponding coding exon is not long enough or highly divergent. Alternatively, your genome might have structural variation at the relevant region. Even in such cases, spaln try to find the most plausible ones up or down to IntronPrm.maxl bp apart from the end of the currently aligned position, which might cause the artefacts. You may have an idea about what happens by individually running spaln -O1 to see the alignment of a problematic case.

IntronPrm.maxl is defined as the 99% quantile of the intron-length distribution specified by the -T option, but currently it cannot be directly modulated by a command line option. Instead, -XQq can control the block level extension (q also specifies the q quantile of the intron-length distribution) from the core blocks (having block scores greater than a threshold). For example, you may try -XQ90 to reduce the extension of the genomic area outside the core region at the risk that true long introns may be missed.

Osamu,

ogotoh avatar Jun 03 '22 05:06 ogotoh

Dear Osamu,

Thank you for your reply, and I will try -XQ option to see what happen.

Best regards, Kun

xiekunwhy avatar Jun 09 '22 03:06 xiekunwhy

Hi Osamu,

How to find IntronPrm.maxl file? I want to edit 99% to a smaller value, some intron length is still abnormal.

Here are some statistics about mRNA and intron length generated by spaln.

Zja is a monocot plant I want to annotate. zja.evigene.okay.aa is amino acid sequences file (okay set) generated by evigene pipeline (http://arthropods.eugenes.org/EvidentialGene/evigene/, the input est set is RNA-bloom assembly), this aa set is good enough according to BUSCO results.

gffread -V means delete mRNAs which with in-frame stop codon.

In the following table, Zja.p1 means: spaln -t10 -M4 -Q7 -O0 -LS -ya012 -yX1 -d spaln zja.evigene.okay.aa > zja.evigene.okay.aa1.gff3 + gffread -V Zja.p2 means: spaln -t10 -XQ90 -M4 -Q7 -O0 -LS -ya012 -yX1 -d spaln zja.evigene.okay.aa > zja.evigene.okay.aa2.gff3 + gffread -V Zja.p3 means: spaln -t10 -XQ80 -M4 -Q7 -O0 -LS -ya012 -yX1 -d spaln zja.evigene.okay.aa > zja.evigene.okay.aa3.gff3 + gffread -V

image

As we can see, XQ90 worked, but the average intron length is still unacceptable comparing with other monocot plant.

The phenomenon is not only occured in monocot plants, but also in dicot plants and animals.

I want to use spaln results to train augustus, so I need to get more accurately results. Do you have some more suggestions?

Best, Kun

xiekunwhy avatar Jun 22 '22 13:06 xiekunwhy

Dear Kun,

Sorry for the long period of not responding to your comment.

I have just updated a new version Ver2.4.12 (by mistakes I have already used Ver2.4.10 and Ver2.4.11, so that the version number is discontinuous). By carefully reexamining the codes, I found several potential causes of extremely long introns you reported. The most problematic was that the previous version tried to find remote terminal (first or last) exons without seeking for closer candidates. In the new version, this problem has been fixed. I have also remedied several problematic changes introduced in the course of version ups. As a whole, the new version successfully reduces the error rates upon several test runs. I hope these modification help you.

Although not explicitly documented, this version allows to read binary form of parameter files, such as Splice5.dat, CodePotTab.dat. In the next release, I will upload C++ programs and scripts to generate these files and to convert between text and binary files, when the genomic sequence and a set of transcript sequences are given.

Osamu,

ogotoh avatar Jul 20 '22 01:07 ogotoh

Dear Osamu,

Thank you very much, I will try Ver2.4.12 and feadback results.

Best, Kun

xiekunwhy avatar Jul 20 '22 02:07 xiekunwhy

Hi Osamu,

In fact, there is another problem according to above table, the single CDS gene number of spaln results is larger than expected even when the input protein sequences are not fragemented.

Best, Kun

xiekunwhy avatar Jul 22 '22 08:07 xiekunwhy