boltz icon indicating copy to clipboard operation
boltz copied to clipboard

Feature Request: Add --save_msa flag to export MSA files

Open cbologa opened this issue 6 months ago • 7 comments

Problem Description

When using Boltz-2 with the --use_msa_server flag, the tool automatically generates MSAs via the ColabFold MSA server. However, these MSAs are only used internally for structure prediction and are not saved to disk. This creates several challenges:

  • Lost computational investment: MSA generation can take significant time, but the results are discarded after prediction

  • No reproducibility: Users cannot inspect, verify, or reuse the MSAs that were actually used in their structure predictions

  • Inefficient workflows: Running multiple predictions with the same sequences requires regenerating MSAs each time

  • Limited debugging: When predictions fail or produce unexpected results, users cannot examine the MSA quality or coverage

Proposed Solution

Add a --save_msa flag that saves all MSA files used during prediction (whether server-generated or user-provided) to the output directory in standard a3m format.

Example Usage

# Save MSAs when using the MSA server
boltz predict input.yaml --use_msa_server --save_msa

# Save MSAs when using local files (for consistency/archival)
boltz predict input_with_msas.yaml --save_msa

cbologa avatar Jun 08 '25 22:06 cbologa

Actually the MSA's do get dumped, though that's not really documented. If you look at the output folder there is a folder named msa.

jwohlwend avatar Jun 09 '25 07:06 jwohlwend

Kinda related question is that I figured out how to calculate custom MSA but I wonder how do you deal with template? Usually colabfold server handle that part but Boltz asks for a specific .pdb as template. I'm just not sure how to deal with that. Thank you!

bio-rat avatar Jun 09 '25 08:06 bio-rat

I saw the folder, but it's empty after my runs. I'm using boltz 2.0.3.

On Mon, Jun 9, 2025, 1:58 AM Jeremy Wohlwend @.***> wrote:

jwohlwend left a comment (jwohlwend/boltz#320) https://github.com/jwohlwend/boltz/issues/320#issuecomment-2954994793

Actually the MSA's do get dumped, though that's not really documented. If you look at the output folder there is a folder named msa.

— Reply to this email directly, view it on GitHub https://github.com/jwohlwend/boltz/issues/320#issuecomment-2954994793, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDNRKGFLEPFPWJ5PM56KBD3CU5DZAVCNFSM6AAAAAB63SHHTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNJUHE4TINZZGM . You are receiving this because you authored the thread.Message ID: @.***>

cbologa avatar Jun 09 '25 15:06 cbologa

Were you running with msa server when you ended up with an enmpty folder or custom MSA? I have verified that the folder is non-empty. Please share a way for me to replicate!

jwohlwend avatar Jun 10 '25 02:06 jwohlwend

@bio-rat we do not support automated templating. The goal of our templating logic is to allow people to use specific templates they want to use. Automated templating is something we considered but doesn't have as strong of a use case in our opinion (no major performance gains are expected from adding templates that the model has seen during training)

jwohlwend avatar Jun 10 '25 02:06 jwohlwend

Nevermind, I was using the yaml flag msa: empty, that's why I didn't get the msa in the output. Without that, now I got:

$ ls -R boltz_results_test/msa
boltz_results_test/msa:
test_0.csv  test_unpaired_tmp_env

boltz_results_test/msa/test_unpaired_tmp_env:
bfd.mgnify30.metaeuk30.smag30.a3m  msa.sh  out.tar.gz  pdb70.m8  uniref.a3m

Which of those a3m files can I now use to resubmit the job without using again the msa server for the same protein and a different ligand? Or do I have to submit to the msa server for each different ligand? Here is my current yaml file:

$ cat test.yaml
version: 1
sequences:
  - protein:
      id: A
      sequence: MSPECARAAGDAPLRSLEQANRTRFPFFSDVKGDHRLVLAAVETTVLVLIFAVSLLGNVCALVLVARRRRRGATACLVLNLFCADLLFISAIPLVLAVRWTEAWLLGPVACHLLFYVMTLSGSVTILTLAAVSLERMVCIVHLQRGVRGPGRRARAVLLALIWGYSAVAALPLCVFFRVVPQRLPGADQEISICTLIWPTIPGEISWDVSFVTLNFLVPGLVIVISYSKILQITKASRKRLTVSLAYSESHQIRVSQQDFRLFRTLFLLMVSFFIMWSPIIITILLILIQNFKQDLVIWPSLFFWVVAFTFANSALNPILYNMTLCRNEWKKIFCCFWFPEKGAILTDTSVKRNDLSIISG
  - ligand:
      id: B
      smiles: 'O=S1(=O)C2=CC=CC=C2CN1C1=CC=CC(OC2=CC=CC=C2)=C1'
properties:
  - affinity:
      binder: B

cbologa avatar Jun 10 '25 03:06 cbologa

I suspect that I just need to concatenate uniref.a3m and bfd.mgnify30.metaeuk30.smag30.a3m and use the resulting file as the local MSA. Could you please confirm if that's the case?

cbologa avatar Jun 10 '25 20:06 cbologa

I would love to know the answer to this- sitting in my own directory looking at both of these .a3m files wondering if I can make use of them somehow

jenchem-ot avatar Jun 27 '25 16:06 jenchem-ot

Earlier I opened PR #461, which allows you to generate the MSA output without having to download or run the full Boltz model. Behind the scenes, it still relies on the msa_server, so it's not suitable for sensitive data. But it can still be useful for generating MSA files in a more efficient and lightweight way, and it produces the exact same output as boltz does during its prediction calculations.

As for how the concatenation works: it's handled by the compute_msa function in main.py. You can find the relevant snippet here:

def compute_msa(
    data: dict[str, str],
    target_id: str,
    msa_dir: Path,
    msa_server_url: str,
    msa_pairing_strategy: str,
) -> None:
    """Compute the MSA for the input data.

    Parameters
    ----------
    data : dict[str, str]
        The input protein sequences.
    target_id : str
        The target id.
    msa_dir : Path
        The msa directory.
    msa_server_url : str
        The MSA server URL.
    msa_pairing_strategy : str
        The MSA pairing strategy.

    """
    if len(data) > 1:
        paired_msas = run_mmseqs2(
            list(data.values()),
            msa_dir / f"{target_id}_paired_tmp",
            use_env=True,
            use_pairing=True,
            host_url=msa_server_url,
            pairing_strategy=msa_pairing_strategy,
        )
    else:
        paired_msas = [""] * len(data)

    unpaired_msa = run_mmseqs2(
        list(data.values()),
        msa_dir / f"{target_id}_unpaired_tmp",
        use_env=True,
        use_pairing=False,
        host_url=msa_server_url,
        pairing_strategy=msa_pairing_strategy,
    )

    for idx, name in enumerate(data):
        # Get paired sequences
        paired = paired_msas[idx].strip().splitlines()
        paired = paired[1::2]  # ignore headers
        paired = paired[: const.max_paired_seqs]

        # Set key per row and remove empty sequences
        keys = [idx for idx, s in enumerate(paired) if s != "-" * len(s)]
        paired = [s for s in paired if s != "-" * len(s)]

        # Combine paired-unpaired sequences
        unpaired = unpaired_msa[idx].strip().splitlines()
        unpaired = unpaired[1::2]
        unpaired = unpaired[: (const.max_msa_seqs - len(paired))]
        if paired:
            unpaired = unpaired[1:]  # ignore query is already present

        # Combine
        seqs = paired + unpaired
        keys = keys + [-1] * len(unpaired)

        # Dump MSA
        csv_str = ["key,sequence"] + [f"{key},{seq}" for key, seq in zip(keys, seqs)]

        msa_path = msa_dir / f"{name}.csv"
        with msa_path.open("w") as f:
            f.write("\n".join(csv_str))

Jnelen avatar Jul 10 '25 16:07 Jnelen