spaln icon indicating copy to clipboard operation
spaln copied to clipboard

negative coordinates in spaln output gff

Open vicenciooostra opened this issue 3 years ago • 14 comments

Hi, many thanks for making and maintaining this great program!

I've run spaln to annotate a fragmented de novo genome assembly using a reference protein set from a related species. It's a great tool! It runs successfully without error messages. However, when I parsed the output gff file, I noticed a problem with one entry having negative coordinates: ##sequence-region scaffold44455 1 2094 scaffold44455 ALN gene -7780 -10 640 - . ID=gene22400;Name=scaffold44455_-3 scaffold44455 ALN mRNA -7780 -10 640 - . ID=mRNA22400;Parent=gene22400;Name=scaffold44455_-3 scaffold44455 ALN cds -40 -10 82 - 0 ID=cds168197;Parent=mRNA22400;Name=scaffold44455_-3;Target=BANY.1.2.t24108 1 12 + scaffold44455 ALN cds -135 -116 17 - 2 ID=cds168198;Parent=mRNA22400;Name=scaffold44455_-3;Target=BANY.1.2.t24108 13 19 + scaffold44455 ALN cds -285 -209 32 - 0 ID=cds168199;Parent=mRNA22400;Name=scaffold44455_-3;Target=BANY.1.2.t24108 20 43 + ... etc The same negative coordinates appear in the other output formats I used (O3, O4, O7).

I checked all other genes, and this is the only one with that problem. All other >24k genes give valid output. EDIT: actually it's 55 genes (made a mistake when checking)

For this particular gene, the scaffold is 2095 bp long, and the protein is 879 AA. Could this problem occur because the scaffold is too short, i.e. shorter than the protein it matches to?

This is the command I used: de_novo_assembly_file=de_novo_assembly_fna_single_lines.mfa protein_file=reference_species_whole_genome_proteins_fa_single_lines.faa spaln -Q7 -O0,4,7,3 -S3 -M1 -t7 -d $de_novo_assembly_file $protein_file -o $outfolder > $outfile

Any help would be much appreciated!

Kind regards, Vicencio

vicenciooostra avatar Dec 11 '20 09:12 vicenciooostra

Could this problem occur because the scaffold is too short, i.e. shorter than the protein it matches to?

Thank you for your report. Yes, that can happen when you don’t set –LS (local similarity) option. By default (without -LS), spaln attempts semi-global alignment between the genomic and the query sequences, where semi-global means unaligned parts at the ends of the genomic sequence are ignored. I haven’t so far much examined what happen when genomic sequence < 3 * protein. Probably some filtering or automatic option changes might be necessary. I want to address this problem in future releases.

At this moment, please retry spaln with -LS option for the genes with negative coordinates.

Osamu,

ogotoh avatar Dec 14 '20 02:12 ogotoh

Hi Osamu,

many thanks for your prompt reply, and the explanation.

I've tried as you suggested, but again I get genes with negative coordinates. This was my command: spaln -Q7 -LS -O0,4,7,3 -S3 -M1 -t7 -d $de_novo_assembly_file $protein_file -o $outfolder > $outfile Instead of 55 genes with negative coordinates it gave 53. I've also tried changing -M1 to -M4, this actually produced more genes with negative coordinates (123), which makes sense as it produces more loci overall.

Next I will try to pre-filter my de novo genome assembly, which is highly fragmented. I was thinking of removing any contigs <1000 bp, even though there are many of them and I'd rather keep them in. Will post back with results.

Kind regards, Vicencio

vicenciooostra avatar Dec 14 '20 17:12 vicenciooostra

Dear Vicencio,

Thanks, I am looking forward to your continued report. If possible, could you kindly send me a few examples for which spaln produced negative coordinates?

Osamu,

ogotoh avatar Dec 15 '20 05:12 ogotoh

Hi Osamu,

I repeated the same spaln command I ran above (with -Q7 -LS -M4) and again I get alignments with negative coordinates. I am sending you genomic scaffold, protein sequence, and gff output for 2 of these examples via email (hope that's OK). The first example is from a long scaffold, the second one from a short one.

To be honest I am not sure how to proceed. It only happens in less than 200 cases (out of >47k alignments) so I think I will just remove these alignments from my gff. Do you think that is sensible?

Kind regards, Vicencio

vicenciooostra avatar Dec 16 '20 16:12 vicenciooostra

Dear Vicencio,

Many thanks. I got your examples that I just wanted. I will see what is happening to improve spaln's performance and also to satisfy my curiosity.

Do you think that is sensible?

Of course! I think future modifications might automate such procedures.

Osamu,

ogotoh avatar Dec 17 '20 00:12 ogotoh

Dear Osamu,

many thanks for your kind help.

After removing the alignments with negative coordinates from the gff / .O0 file, I discovered another set of invalid alignments: the coordinate of the alignment in the scaffold extended beyond the length of the scaffold. For one alignment, there was a genomic coordinate 423622 for a gene on a scaffold, but this scaffold was only 418k in length. I encountered this problem in several 3 alignments but haven't made a systematic search, so I don't know how large the problem is. This was my code: spaln -Q7 -LS -O0,4,7,3 -S3 -M4 -t4 -d $de_novo_assembly_file $protein_file -o $outfolder > $outfile I am using spaln 2.4.03.

If you want I can again send you an example (or the whole data). For now, I will simply try to filter these invalid gene models out of the gff file. I hope there are not too many.

But in general it would be useful to know what is causing this behaviour. My goal is to annotate a fragmented, Illumina-only de novo assembly based on reference gene models from another (related) species, so I can fish out the protein coding sequences from the fragments. Previous runs using Spaln 2.1.4 indicated that I could recover >80% of my reference proteins at >90% coverage, so that was very encouraging. I updated to 2.4.03 because for some reason the program crashed for one of my assemblies.

Anyway, thanks again for your help! Kind regards, Vicencio

vicenciooostra avatar Dec 18 '20 14:12 vicenciooostra

I have gotten the possible reason why spaln reports negative or out-of-range coordinates. The hint comes from your example .O0 files. In example1_gff3.O0, for example, the scaffold name of each ‘exon line’ is scaffold00392 but the bottom line indicates that the sequence-region comes from another scaffold scaffold00432. Most likely the coordinate number is counted on the bases of scaffold00432 rather than the correct scaffold00392. Unfortunately, I cannot confirm this hypothesis, as I haven’t the genome assembly encompassing these two scaffolds.

Spaln version 2.4.0x somewhat differs from the current version 2.4.1x in extracting the relevant genome segment from the whole assembly. You may try the newer version. Additionally, I strongly suggest that you should set proper -T option for good performance of Spaln. As your species is likely to be Lepidoptera, I suggest you to use -T bombmori, a known parameter set of a representative Lepidoptera, Bombyx mori.

I hope these suggestions help your work.

Have a happy holiday season!!

Osamu,

ogotoh avatar Dec 19 '20 08:12 ogotoh

Hi Osamu,

thanks very much for this! I hadn't explored that -T parameter. Indeed I work with butterflies but since it's a non-model I didn't bother. Will use Bombyx mori as suggested - great that you have it included.

I actually think I made a mistake when I sent you those .O0 files - instead of including the first line which starts with ## sequence-region I included the last one, which I think refers to the next scaffold. In my complete .O0 file there's first a line with ## sequence-region and then the scaffold name, and then the alignments on that scaffold.

In any case I will use the newer version, in conjunction with the -T parameter.

Happy & restive holidays to you!

All the best, Vicencio

vicenciooostra avatar Dec 23 '20 09:12 vicenciooostra

Dear Vicencio,

Please wait one or two days before you install a new version.

Using a WGS genome and Swiss-Prot aa sequences, I found a few cases in which negative coordinates appear with version 2.4.04. This problem appears to have been fixed in version 2.4.11. However, I found another problem that causes segmentation faults in rare cases. I have already figured out the causes of this issue, and am going to release a modified version after further examinations.

Osamu,

ogotoh avatar Dec 24 '20 00:12 ogotoh

Hi Osamu,

many thanks for this, that's great news.

Version 2.4.11 gave similar negative or invalid coordinates as earlier versions.

I've just downloaded 2.4.2, but unfortunately it failed to compile. I've copied the error messages here: https://pastebin.com/etCUpuup

My commands were as follows, they worked fine for previous version of spaln (2.4.11). sudo wget https://github.com/ogotoh/spaln/archive/ver.2.4.2.tar.gz sudo tar xfz ver.2.4.2.tar.gz cd spaln-ver.2.4.2/src sudo ./configure --table_dir=/media/disk4/data/projects/databases/spaln/table/ --alndbs_dir=/media/disk4/data/projects/databases/spaln/seqdb/ sudo make

Kind regards, Vicencio

vicenciooostra avatar Dec 30 '20 09:12 vicenciooostra

Dear Vicencio,

Thank you very much for your report. Upon uploading, I missed to update several files. I’ve just uploaded revised version of Spaln2.4.2.

Osamu,

ogotoh avatar Dec 31 '20 00:12 ogotoh

Dear Osamu,

Apologies for the lapse in reports. I've tried again with 2.4.2, and the problem persists: negative coordinates, and coordinate that fall outside the range of the gene. This was the command: spaln -Q7 -LS -T helimelp -O0,4,7,3 -S3 -M4 -t8 -d $de_novo_assembly $reference_proteome >> $logfile 2>&1

This problem occurs ca. 200 times (out of ca. 45k reference proteins), so it's a small problem. Right now I've moved on by just filtering these bad lines out of the gff and proceeding with the rest. So I'm just writing to update you, perhaps you can have another look for a future version update.

Kind regards, Vicencio

vicenciooostra avatar Feb 05 '21 10:02 vicenciooostra

Dear Vicencio,

I've tried again with 2.4.2, and the problem persists:

I am somewhat sad to hear this, but am glad that spaln seems to be useful for your study. I will continue to make efforts to improve performance and reduce errors of spaln. When I get a solution, I will report you as soon as possible. I thank you again for your series of reports on spaln.

Best regards,

Osamu

ogotoh avatar Feb 08 '21 23:02 ogotoh

Well, thank you for making and maintaining and providing support for this great software! It is very useful, and the number of errors is small compared to the total number of alignments!

vicenciooostra avatar Feb 10 '21 11:02 vicenciooostra