diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Incorrect self-alignment hit

Open wongkarenhy-hex opened this issue 3 years ago • 3 comments

Code to reproduce (tried with v2.0.6 and v2.0.14):

echo -e ">test_gene\nMAAADGDDSLYPIAVLIDELRNEDVQLRLNSIKKLSTIALALGVERTRSELLPFLTDTIYDEDEVLLALAEQLGTFTTLVGGPEYVHCLLPPLESLATVEETVVRDKAVESLRAISHEHSPSDLEAHFVPLVKRLAGGDWFTSRTSACGLFSVCYPRVSSAVKAELRQYFRNLCSDDTPMVRRAAASKLGEFAKVLELDNVKSEIIPMFSNLASDEQDSVRLLAVEACVNIAQLLPQEDLEALVMPTLRQAAEDKSWRVRYMVADKFTELQKAVGPEITKTDLVPAFQNLMKDCEAEVRAAASHKVKEFCENLSADCRENVIMSQILPCIKELVSDANQHVKSALASVIMGLSPILGKDNTIEHLLPLFLAQLKDECPEVRLNIISNLDCVNEVIGIRQLSQSLLPAIVELAEDAKWRVRLAIIEYMPLLAGQLGVEFFDEKLNSLCMAWLVDHVYAIREAATSNLKKLVEKFGKEWAHATIIPKVLAMSGDPNYLHRMTTLFCINVLSEVCGQDITTKHMLPTVLRMAGDPVANVRFNVAKSLQKIGPILDNSTLQSEVKPILEKLTQDQDVDVKYFAQEALTVLSLA" > test.fasta

diamond makedb --in test.fasta --db test.dmnd
diamond blastp -q test.fasta -d test.dmnd

Output of blastp (only the first 200a.a. got aligned even though this is a self-alignment):

test_gene	test_gene	100.0	201	0	0	1	201	1	201	3.2e-134	392.9

Using --masking 0 does the trick but I think there is a potential bug with soft-masking the diamond db sequences

diamond blastp -q test.fasta -d test.dmnd --masking 0
test_gene	test_gene	100.0	589	0	0	1	589	1	589	0.0e+00	1148.3

As you can see, the sequence alignment at pos 201 ends with ...ELDNV, which corresponds to the beginning of soft-masked bases in the dmnd file:

diamond getseq --db test.dmnd                
>test_gene
MAAADGDDSLYPIAVLIDELRNEDVQLRLNSIKKLSTIALALGVERTRSELLPFLTDTIYDEDEVLLALAEQLGTFTTLVGGPEYVHCLLPPLESLATVEETVVRDKAVESLRAISHEHSPSDLEAHFVPLVKRLAGGDWFTSRTSACGLFSVCYPRVSSAVKAELRQYFRNLCSDDTPMVRRAAASKLGEFAKVLELDNVkseiipmfsnlasdeqdsvrllaveacvniaqllpqedlealvmptlrqaaedkswrvrymvadkftelqkavgpeitktdlvpafqnlmkdceaevraaashkvkefcenlsadcrenvimsqilpcikelvsdanqhvksalasvimglspilgkdntiehllplflaqlkdecpevrlniisnldcvnevigirqlsqsllpaivelaedakwrvrlaiieympllagqlgveffdeklnslcmawlvdhvyaireaatsnlkklvekfgkewahatiipkvlamsgdpnylhrmttlfcinvlsevcgqdittkhmlptvlrmagdpVANVRFNVAKSLQKIGPILDNSTLQSEVKPILEKLTQDQDVDVKYFAQEALTVLSLA

Thanks!

wongkarenhy-hex avatar Jun 16 '22 21:06 wongkarenhy-hex

I'm not sure where you see a bug here. Masking is hard-masking by default, support for soft-masking will be included in the next version.

bbuchfink avatar Jun 21 '22 10:06 bbuchfink

It's unclear to me if that is a potential bug or the expected output of diamond's default behavior. Without using --masking 0, only a third of the sequence (201bp) got aligned even though the query and db sequences are exactly the same (589bp long). I was surprised to see truncated results for self-alignment. When I trimmed the end of the original sequence by ~60bp, the entire sequence got aligned (see below) and perhaps that can be attributed to the differing masking behaviors:

(hxenv) karenwong@karens-mbp hx % echo -e ">test_gene\nMAAADGDDSLYPIAVLIDELRNEDVQLRLNSIKKLSTIALALGVERTRSELLPFLTDTIYDEDEVLLALAEQLGTFTTLVGGPEYVHCLLPPLESLATVEETVVRDKAVESLRAISHEHSPSDLEAHFVPLVKRLAGGDWFTSRTSACGLFSVCYPRVSSAVKAELRQYFRNLCSDDTPMVRRAAASKLGEFAKVLELDNVKSEIIPMFSNLASDEQDSVRLLAVEACVNIAQLLPQEDLEALVMPTLRQAAEDKSWRVRYMVADKFTELQKAVGPEITKTDLVPAFQNLMKDCEAEVRAAASHKVKEFCENLSADCRENVIMSQILPCIKELVSDANQHVKSALASVIMGLSPILGKDNTIEHLLPLFLAQLKDECPEVRLNIISNLDCVNEVIGIRQLSQSLLPAIVELAEDAKWRVRLAIIEYMPLLAGQLGVEFFDEKLNSLCMAWLVDHVYAIREAATSNLKKLVEKFGKEWAHATIIPKVLAMSGDPNYLHRMTTLFCINVLSEVCGQDITTKHMLPTVL" > test.fasta 
(hxenv) karenwong@karens-mbp hx % diamond makedb --in test.fasta --db test.dmnd >&/dev/null
(hxenv) karenwong@karens-mbp hx % ./bin/diamond/mac/diamond blastp -q test.fasta -d test.dmnd    
test_gene	test_gene	100.0	526	0	0	1	526	1	526	2.0e-155	442.6

Thanks for looking into this and really appreciate your help!

wongkarenhy-hex avatar Jun 21 '22 16:06 wongkarenhy-hex

Hard masking also applies to identical self alignments. Whether that makes sense or not is debatable, but it is not a bug.

bbuchfink avatar Jun 23 '22 10:06 bbuchfink