jvarkit icon indicating copy to clipboard operation
jvarkit copied to clipboard

biostar94573 and multiple sequence alignments

Open ntchuang opened this issue 10 years ago • 4 comments

Maybe the MAFFT output doesn't give the proper format for your tool to run, but I am not getting correct looking results. Can you look at what MAFFT outputs here: http://mafft.cbrc.jp/alignment/server/spool/_out151218093135893D4OpNAX8jGoYH7Tx2bF0C.html

It looks similar to your clustal sample output but without the conservation notation at the end of each segment. I even tried their fasta format with the hyphens for gaps but it gave the same looking output.

ntchuang avatar Dec 18 '15 16:12 ntchuang

I don't think there is a problem: some of your sequences have a very large deletion.

>4:98103819
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------------agctttgaagagagcagtggttc
tcccaggacgcagctggagatctgagaacggg----cagactgcctcctcaagtgggtcc
ctgactcctgacccccgagcagcctaactgggaggca-cccccagcaggggcaca-----
--ctgacacctcacacggcagggtattccaacagacctgcagctgagggtcctgtctgtt

the program tries to compile the indel at the same position, the more there are some large indels, the more you'll have a large deletion in the VCF.

lindenb avatar Dec 18 '15 17:12 lindenb

Hi Pierre,

thanks for the quick reply. Maybe I don't understand how it interprets what should be an entry in the vcf. Since all I provided was a multi sequence alignment it probably does not know what REF is? I was hoping it would call variants for anything that was not conserved 100%. My output of that clustal file only has variants from position 2139 to 3836. Just looking quickly there should be tons of deletions called from the beginning?

my syntax was java -jar biostar94573.jar mafft.aln

Thanks!

ntchuang avatar Dec 18 '15 17:12 ntchuang

the program scans from 5' to 3' and , for the deletions, search for the '-'. For one deletion and as long as you're going to have some '-' at the same position, the program will extend the size of the current variation. Your file have a deletion at almost each position: that is why you get only a few variants...

lindenb avatar Dec 18 '15 17:12 lindenb

an idea: '-' are interpreted as a deletion. try to replace the leading and trailing '-' with spaces.

lindenb avatar Dec 18 '15 23:12 lindenb