ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

Keeping included sequences constant in subsequent MSAs

Open bschilder opened this issue 6 months ago • 0 comments

Hello,

Thanks for the really helpful software!

In my case, I have a fasta sequence containing personalized protein sequences derived from a biobank population. When I run colabfold_batch on this FASTA, it iteratively queries MMseqs2 server for each sequence separately. This means that the non-input sequences included in each resulting MSA are going to be a bit different, which complicates making direct comparisons with the AF2 outputs.

colabfold_batch --save-single-representations --save-pair-representations ENST00000357654.fasta af2

Is there a way I can create separate MSAs for each personalized input sequence, but still keep the non-input sequences the same?

Here is a cartoon example of what I'm trying to achieve:

Input FASTA

>ref_seq
XXXXXX
>personalized_seq1
XYXXXX
>personalized_seq2
XXX-XX
...

Output MSA for ref

>ref_seq
XXXXXX
>msa_seqA
XXXXZX
>msa_seqB
XXQXXX
...

Output MSA for personalized_seq1

>personalized_seq1
XYXXXX
>msa_seqA
XXXXZX
>msa_seqB
XXQXXX
...

Output MSA for personalized_seq2

>personalized_seq2
XXX-XX
>msa_seqA
XXXXZX
>msa_seqB
XXQXXX
...

I've tried manually editing the a3m files but can't quite figure out which sequence corresponds to my input query sequence, or whether simply replacing it with a different sequence would produce a valid new file (the new sequence can contain indels which could mess up the alignment).

Many thanks in advance!,

Brian M. Schilder, PhD Postdoctoral Research Scientist Simons Center for Quantitative Biology CV | bschilder.github.io/CV/CV LinkedIn | linkedin.com/in/brian-schilder Lab | koo-lab.github.io

bschilder avatar Jul 03 '25 14:07 bschilder