Final Sequence is UNKNOWN aminoacids
Hi Teams,
I have run your inference scripts:
./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10
However, when I check the seq_t after each timestep and final_seq, both of them are 21, which means the unknown aminoacids. Then you mapped 21 to 7 (G) (~GLY). Therefore, the generated PDB file contains a sequence of GLYs. Did I run correctly? Could you explains the reason why the output after denoising is still Unknown? Thank you so much!
RFdiffusion is a tool to create backbone structure. It does not attempt to determine the amino acid identity of the newly created residues. As such, it's expected that the output structure would be poly-glycine in the newly diffused regions.
The standard protocol is to run a tool like ProteinMPNN or LigandMPNN on the structures output by RFdiffusion to determine what sequence will fold to that structure.
Alternatively, there are tools like ProteinGenerator which attempt to do simultaneous backbone and sequence generation. However, most of the recent successes for protein design in the community have been with the RFdiffusion/ProteinMPNN pipeline, rather than with a combined tool. Even when approaches for backbone generation (e.g. AlphaFold hallucination) which do nominally predict sequences have been used, it's sometimes found that better results can be obtained by discarding the generated sequence and using a second step of a dedicated structure->sequence protocol (e.g. BindCraft). So depending on your system, a combined prediction isn't necessarily going to give better results versus a two-stage protocol.
Thanks for you explanation! And one more question : Is the backbone structure output the 3D-coordinates that are shown on the above example? And from the structure output, another tool like ProteinMPNN will determine which kinds of atoms will be placed at these coordinates. Is my understanding is correct? Thanks!
The output PDB files from RFdiffusion include the generated backbone 3D coordinates. ProteinMPNN and other such sequence design tools (typically) do not change the backbone coordinates, but they will change the residue identities, and (for those which output PDBs) will create the atom coordinates for the sidechain atoms.