openfold icon indicating copy to clipboard operation
openfold copied to clipboard

multimer using mmseqs generated sequence alignments?

Open emzodls opened this issue 1 year ago • 2 comments

Hello, I'm trying to run openfold multimer inference on some fasta files I have. I've been using the collabfold databases to generate the sequence alignments as these are smaller than the AF2 databases. This has worked for single sequences however, I'm having issues getting this to work for multimer inference. I'm getting ValueError: Missing 'uniprot_hits.sto' This is required for Multimer MSA pairing. Is there a way to use mmseqs alignments for multimer inference or do I have to use the AF2 alignment pipeline? Thanks.

emzodls avatar Dec 05 '23 15:12 emzodls

Hi, yes currently you do need to use the AF2 alignment pipeline, but only for the uniprot alignments. So if you already have the mmseqs alignments, you can precompute the uniprot files like so to skip all the other AF2 alignments:

python scripts/precompute_alignments.py <input_dir> <output_dir> --uniprot_database_path <path_to_dbs>/uniprot/uniprot.fasta --jackhmmer_binary_path <path_to_jackhmmer_binary>

I'll look into adding functionality to avoid/replace this step.

christinaflo avatar Dec 06 '23 16:12 christinaflo

Hi, thanks for the workaround. Similar to the question above, I've been thinking about generating a custom genetic database and using mmseqs to generate the MSAs - a lot of my query sequences are poorly represented in existing databases. If I'm also adding the uniprot alignments alongside my custom MSAs, how will these two different data sources affect the outcome?

Ultimately I'm going to want to also retrain (or more realistically fine-tune) the model, but this is a different conversation.

dthorburn avatar Jan 05 '24 11:01 dthorburn

Hi,

when I run this command:

python ${OF_SCRIPTS}/precompute_alignments.py \
${IN} \
${OUT} \
--uniprot_database_path /tank/jflucier/mmseqs_dbs/uniprot/uniprot.fasta \
--jackhmmer_binary_path jackhmmer

I get this warning: WARNING:root:More than one input_sequence found in DTX1_DTX2.fa

Is this normal behavior? My fasta has 2 entries and looks like this:

>prot1
MSRPGHGGL.....
>prot2
MAMAPSPSLVQ...

Do I need to split fasta prior to running this command?

Thanks for your help JF

jflucier avatar Apr 22 '24 17:04 jflucier