RFdiffusion icon indicating copy to clipboard operation
RFdiffusion copied to clipboard

Final Sequence is UNKNOWN aminoacids

Open lvtuan98 opened this issue 10 months ago • 3 comments

Hi Teams,

I have run your inference scripts:

./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10

However, when I check the seq_t after each timestep and final_seq, both of them are 21, which means the unknown aminoacids. Then you mapped 21 to 7 (G) (~GLY). Therefore, the generated PDB file contains a sequence of GLYs. Did I run correctly? Could you explains the reason why the output after denoising is still Unknown? Thank you so much!

Image

Image

lvtuan98 avatar Feb 27 '25 10:02 lvtuan98

RFdiffusion is a tool to create backbone structure. It does not attempt to determine the amino acid identity of the newly created residues. As such, it's expected that the output structure would be poly-glycine in the newly diffused regions.

The standard protocol is to run a tool like ProteinMPNN or LigandMPNN on the structures output by RFdiffusion to determine what sequence will fold to that structure.

Alternatively, there are tools like ProteinGenerator which attempt to do simultaneous backbone and sequence generation. However, most of the recent successes for protein design in the community have been with the RFdiffusion/ProteinMPNN pipeline, rather than with a combined tool. Even when approaches for backbone generation (e.g. AlphaFold hallucination) which do nominally predict sequences have been used, it's sometimes found that better results can be obtained by discarding the generated sequence and using a second step of a dedicated structure->sequence protocol (e.g. BindCraft). So depending on your system, a combined prediction isn't necessarily going to give better results versus a two-stage protocol.

roccomoretti avatar Feb 27 '25 15:02 roccomoretti

Thanks for you explanation! And one more question : Is the backbone structure output the 3D-coordinates that are shown on the above example? And from the structure output, another tool like ProteinMPNN will determine which kinds of atoms will be placed at these coordinates. Is my understanding is correct? Thanks!

lvtuan98 avatar Feb 28 '25 04:02 lvtuan98

The output PDB files from RFdiffusion include the generated backbone 3D coordinates. ProteinMPNN and other such sequence design tools (typically) do not change the backbone coordinates, but they will change the residue identities, and (for those which output PDBs) will create the atom coordinates for the sidechain atoms.

roccomoretti avatar Apr 25 '25 18:04 roccomoretti