RFdiffusion icon indicating copy to clipboard operation
RFdiffusion copied to clipboard

Sequence Design in RFpeptides paper

Open hualinhVN opened this issue 2 months ago • 3 comments

Hello everyone,I'm currently implementing a workflow based on the recent RFpeptides paper (Rettie et al., 2025, Nature Chemical Biology) and had a question about the sequence design step. I'd appreciate any insights from the community. The Paper's Workflow: The authors describe an iterative, 4-round process for each diffused backbone, which (as I understand it) looks like this:

  1. Run ProteinMPNN on the RFdiffusion backbone (using temperature of 0.0001) to get the single best sequence.
  2. Run Rosetta FastRelax on the new sequence/backbone complex.
  3. Use the relaxed backbone from the previous step as the new input for ProteinMPNN (again at T=0.0001).
  4. Repeat this MPNN-Relax loop for a total of 4 cycles.

My Alternative Workflow Idea: I was considering an alternative, and potentially computationally cheaper, approach to achieve sequence diversity: 1. Take the original, single backbone from RFdiffusion. Generate 4 sequences on this same fixed backbone using LigandMPNN, but use a higher temperature (e.g., T=0.1, 0.2...?) . 2. Take each of these 4 sequences and run Rosetta FastRelax on them once.

My Questions:

  1. What are the perceived pros and cons of my proposed workflow versus the iterative one in the paper?
  2. The authors' method seems like a local "sequence-structure co-optimization," whereas my idea is more of a "fixed-backbone sampling" followed by refinement. Is one inherently superior for this task?
  3. For those who use ProteinMPNN or LigandMPNN for sampling (not just greedy optimization), what temperature values have you found offer a good balance between meaningful diversity and sequence quality (i.e., avoiding sequences that are too random)?

Any thoughts or experiences with these different design strategies would be extremely helpful.

hualinhVN avatar Nov 04 '25 15:11 hualinhVN

The authors' method seems like a local "sequence-structure co-optimization," whereas my idea is more of a "fixed-backbone sampling" followed by refinement.

I think this encapsulates the main difference well. There's a bit of trend currently to assume that designs which the ML models do well on are the ones which are more likely to be successful. (And conversely, if the ML program doesn't do well on it, it's likely out-of-distribution for "native-like" systems, and thus is much less likely to be a good design.) As such, to increase the success rate, you want to find designs where the prediction programs have high confidence and have results which are consistent with the input design. -- Hence the iterated convergence approach. You keep feeding the results of the design/repredict pipeline back on itself until you come up with a design where none of the programs being used have issues with it. (They all agree the design will turn out how you expect it to.)

Your suggested approach is one to get a diversity of structures, but you might be missing out on that consistency validation. You're not necessarily vetting that ProteinMPNN thinks that the updated backbone is compatible with the current sequence. -- This could be fine: there's no guarantee that the self-consistent structures are better than ones generated by other methods. (It's an assumption rather than a iron-clad fact.) But it may indicate that you may need to have additional stringent filters/selection on the results. You'll generate a diversity of structures, but are they good structures? Will they fold to what you want them to do and be active how you want?

Also keep in mind that the info that you're feeding into the experiment is just the sequence of the design -- you can't specify the structure outside of specifying the sequence. As such, any relax step which happens after the final sequence design step is only valuable to the extent it helps you select which sequences you take forward to experimental testing.

roccomoretti avatar Nov 04 '25 16:11 roccomoretti

@roccomoretti Thank you so much! I have one more question: is there any particular reason behind using exactly 4 loops, rather than 3, 5, or any other number?

hualinhVN avatar Nov 04 '25 17:11 hualinhVN

I don't know why they chose 4 cycles -- my guess is that's what they found to generally work well in practice to give decently convergent results without spending too much computational time.

roccomoretti avatar Nov 04 '25 17:11 roccomoretti