Feature Request: Add --save_msa flag to export MSA files
Problem Description
When using Boltz-2 with the --use_msa_server flag, the tool automatically generates MSAs via the ColabFold MSA server. However, these MSAs are only used internally for structure prediction and are not saved to disk. This creates several challenges:
-
Lost computational investment: MSA generation can take significant time, but the results are discarded after prediction
-
No reproducibility: Users cannot inspect, verify, or reuse the MSAs that were actually used in their structure predictions
-
Inefficient workflows: Running multiple predictions with the same sequences requires regenerating MSAs each time
-
Limited debugging: When predictions fail or produce unexpected results, users cannot examine the MSA quality or coverage
Proposed Solution
Add a --save_msa flag that saves all MSA files used during prediction (whether server-generated or user-provided) to the output directory in standard a3m format.
Example Usage
# Save MSAs when using the MSA server
boltz predict input.yaml --use_msa_server --save_msa
# Save MSAs when using local files (for consistency/archival)
boltz predict input_with_msas.yaml --save_msa
Actually the MSA's do get dumped, though that's not really documented. If you look at the output folder there is a folder named msa.
Kinda related question is that I figured out how to calculate custom MSA but I wonder how do you deal with template? Usually colabfold server handle that part but Boltz asks for a specific .pdb as template. I'm just not sure how to deal with that. Thank you!
I saw the folder, but it's empty after my runs. I'm using boltz 2.0.3.
On Mon, Jun 9, 2025, 1:58 AM Jeremy Wohlwend @.***> wrote:
jwohlwend left a comment (jwohlwend/boltz#320) https://github.com/jwohlwend/boltz/issues/320#issuecomment-2954994793
Actually the MSA's do get dumped, though that's not really documented. If you look at the output folder there is a folder named msa.
— Reply to this email directly, view it on GitHub https://github.com/jwohlwend/boltz/issues/320#issuecomment-2954994793, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDNRKGFLEPFPWJ5PM56KBD3CU5DZAVCNFSM6AAAAAB63SHHTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNJUHE4TINZZGM . You are receiving this because you authored the thread.Message ID: @.***>
Were you running with msa server when you ended up with an enmpty folder or custom MSA? I have verified that the folder is non-empty. Please share a way for me to replicate!
@bio-rat we do not support automated templating. The goal of our templating logic is to allow people to use specific templates they want to use. Automated templating is something we considered but doesn't have as strong of a use case in our opinion (no major performance gains are expected from adding templates that the model has seen during training)
Nevermind, I was using the yaml flag msa: empty, that's why I didn't get the msa in the output. Without that, now I got:
$ ls -R boltz_results_test/msa
boltz_results_test/msa:
test_0.csv test_unpaired_tmp_env
boltz_results_test/msa/test_unpaired_tmp_env:
bfd.mgnify30.metaeuk30.smag30.a3m msa.sh out.tar.gz pdb70.m8 uniref.a3m
Which of those a3m files can I now use to resubmit the job without using again the msa server for the same protein and a different ligand? Or do I have to submit to the msa server for each different ligand? Here is my current yaml file:
$ cat test.yaml
version: 1
sequences:
- protein:
id: A
sequence: MSPECARAAGDAPLRSLEQANRTRFPFFSDVKGDHRLVLAAVETTVLVLIFAVSLLGNVCALVLVARRRRRGATACLVLNLFCADLLFISAIPLVLAVRWTEAWLLGPVACHLLFYVMTLSGSVTILTLAAVSLERMVCIVHLQRGVRGPGRRARAVLLALIWGYSAVAALPLCVFFRVVPQRLPGADQEISICTLIWPTIPGEISWDVSFVTLNFLVPGLVIVISYSKILQITKASRKRLTVSLAYSESHQIRVSQQDFRLFRTLFLLMVSFFIMWSPIIITILLILIQNFKQDLVIWPSLFFWVVAFTFANSALNPILYNMTLCRNEWKKIFCCFWFPEKGAILTDTSVKRNDLSIISG
- ligand:
id: B
smiles: 'O=S1(=O)C2=CC=CC=C2CN1C1=CC=CC(OC2=CC=CC=C2)=C1'
properties:
- affinity:
binder: B
I suspect that I just need to concatenate uniref.a3m and bfd.mgnify30.metaeuk30.smag30.a3m and use the resulting file as the local MSA. Could you please confirm if that's the case?
I would love to know the answer to this- sitting in my own directory looking at both of these .a3m files wondering if I can make use of them somehow
Earlier I opened PR #461, which allows you to generate the MSA output without having to download or run the full Boltz model. Behind the scenes, it still relies on the msa_server, so it's not suitable for sensitive data. But it can still be useful for generating MSA files in a more efficient and lightweight way, and it produces the exact same output as boltz does during its prediction calculations.
As for how the concatenation works: it's handled by the compute_msa function in main.py. You can find the relevant snippet here:
def compute_msa(
data: dict[str, str],
target_id: str,
msa_dir: Path,
msa_server_url: str,
msa_pairing_strategy: str,
) -> None:
"""Compute the MSA for the input data.
Parameters
----------
data : dict[str, str]
The input protein sequences.
target_id : str
The target id.
msa_dir : Path
The msa directory.
msa_server_url : str
The MSA server URL.
msa_pairing_strategy : str
The MSA pairing strategy.
"""
if len(data) > 1:
paired_msas = run_mmseqs2(
list(data.values()),
msa_dir / f"{target_id}_paired_tmp",
use_env=True,
use_pairing=True,
host_url=msa_server_url,
pairing_strategy=msa_pairing_strategy,
)
else:
paired_msas = [""] * len(data)
unpaired_msa = run_mmseqs2(
list(data.values()),
msa_dir / f"{target_id}_unpaired_tmp",
use_env=True,
use_pairing=False,
host_url=msa_server_url,
pairing_strategy=msa_pairing_strategy,
)
for idx, name in enumerate(data):
# Get paired sequences
paired = paired_msas[idx].strip().splitlines()
paired = paired[1::2] # ignore headers
paired = paired[: const.max_paired_seqs]
# Set key per row and remove empty sequences
keys = [idx for idx, s in enumerate(paired) if s != "-" * len(s)]
paired = [s for s in paired if s != "-" * len(s)]
# Combine paired-unpaired sequences
unpaired = unpaired_msa[idx].strip().splitlines()
unpaired = unpaired[1::2]
unpaired = unpaired[: (const.max_msa_seqs - len(paired))]
if paired:
unpaired = unpaired[1:] # ignore query is already present
# Combine
seqs = paired + unpaired
keys = keys + [-1] * len(unpaired)
# Dump MSA
csv_str = ["key,sequence"] + [f"{key},{seq}" for key, seq in zip(keys, seqs)]
msa_path = msa_dir / f"{name}.csv"
with msa_path.open("w") as f:
f.write("\n".join(csv_str))