Segmentation fault when using --allow-deletion with result2msa
Expected Behavior
Rather than outputting the MSA files I get a segmentation fault.
Current Behavior
I get a segmentation fault when I add the --allow-deletion flag. Works when I don't use the flag.
Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
Here is the command I was running using a database of 5 pdb files with 8 total chains:
foldseek result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion
Foldseek Output (for bugs)
result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion
MMseqs Version: bb090174ab59557ff9ffc874598f4c3904f55bc6
Substitution matrix aa:3di.out,nucl:3di.out
Gap open cost aa:10,nucl:10
Gap extension cost aa:1,nucl:1
Allow deletions true
Compositional bias 1
Compositional bias 1
MSA format mode 6
Summary prefix cl
Skip query false
Filter MSA 0
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Preload mode 0
Threads 128
Compressed 0
Verbosity 3
Query database size: 8 type: Aminoacid
Target database size: 8 type: Aminoacid
./run_hada_msa.sh: line 3: 2311279 Segmentation fault (core dumped) foldseek result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Installed via conda:
conda install bioconda::foldseek bioconda::mmseqs2 - Ubuntu OS x86_64
- 2 AMD EPYC 7763 64-Core Processor
- 512 GB RAM
Also if you could provide an in-depth explanation of what --allow-deletion does I would very much appreciate it! In the past, I get mixed results when I use it with MMSeqs2 and I am not sure exactly when and how to use it in MMSeqs2.
From my understanding it allows deletions in the query sequence--adds gaps ("-") to the query sequence. I am trying to use it for better column/residue alignments while preserving insertions in other sequences in the MSA. However, this does not work. Currently, my solution/hack is to post-process and delete the insertions to make sure I have a consistent alignment down a column/residue.
It would be great if I didn't have to delete the insertions but instead allow deletions in the query sequence so I do not have to post-process the a3m file. Ideally, I would not have to delete any insertions and while having every column/residue aligned.
Let me know if I could help in any way!
Danny
Essentially, if you could help me figure out how to output an a3m file where the query sequence is in the qaln format and the aligned sequences are in the taln format (referencing foldseek easy-search --format-output). That would be amazing.
I would recommend against using --allow-deletion, it was never fully implemented and can easily overflow memory and crash. I think we allocate 2x the memory if allow-deletion is activated, but the MSA can grow much beyond 2x length. However, figuring this out correctly is a bit finicky and we never really needed this internally.
I don't think we have a good solution, except to continue post-processing.
However, isn't your post-processing step essentially just removing all lowercase letters? The indicate gaps in all other sequences in the A3M format.
Is this also true for —allow-deletions in MMSeqs2? Or just foldseek?
Yeah we just remove the lowercase letters or do 1-1 alignments to have indels in both the query and target sequence.
On Fri, Jun 7, 2024 at 11:20 PM Milot Mirdita @.***> wrote:
I would recommend against using --allow-deletion, it was never fully implemented and can easily overflow memory and crash. I think we allocate 2x the memory if allow-deletion is activated, but the MSA can grow much beyond 2x length. However, figuring this out correctly is a bit finicky and we never really needed this internally.
I don't think we have a good solution, except to continue post-processing.
However, isn't your post-processing step essentially just removing all lowercase letters? The indicate gaps in all other sequences in the A3M format.
— Reply to this email directly, view it on GitHub https://github.com/steineggerlab/foldseek/issues/284#issuecomment-2155800948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPMKKWB7ANBT2ROAR223OTZGKBCDAVCNFSM6AAAAABI7EUE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVHAYDAOJUHA . You are receiving this because you authored the thread.Message ID: @.***>
The same code is run for both.