RFdiffusion icon indicating copy to clipboard operation
RFdiffusion copied to clipboard

binder design

Open fanch1122 opened this issue 10 months ago • 8 comments

1.0 binder

1.1 about rosetta ddG

In the binding molecules, rosetta ddG<-40 is used as the screening condition. I want to know how this part is calculated. I can't find the relevant script. Is the output of protein mpnn used as the mutation and the initial Gly skeleton used as the comparison?

1.2 about pae_interaction

In my design work, I only used about 10 skeletons and got many sequences with pae_interaction<10, which confused me

1.2.1 for example out.sc

SCORE: binder_aligned_rmsd pae_binder pae_interaction pae_target plddt_binder plddt_target plddt_total target_aligned_rmsd time description SCORE: 0.592 3.476 8.182 9.139 85.556 74.229 78.750 1.542 83.551 design_ppi_0_dldesign_0_af2pred SCORE: 0.930 5.717 13.545 9.519 70.353 68.488 69.232 1.716 7.883 design_ppi_0_dldesign_1_af2pred SCORE: 0.885 2.977 7.414 8.460 87.404 76.294 80.728 1.496 7.833 design_ppi_0_dldesign_2_af2pred SCORE: 2.054 7.001 13.005 9.961 66.544 65.600 65.977 2.215 7.828 design_ppi_0_dldesign_3_af2pred SCORE: 0.780 3.225 7.507 8.800 87.345 76.268 80.564 1.516 82.688 design_ppi_1_dldesign_0_af2pred SCORE: 1.126 3.107 6.819 7.912 88.812 79.364 83.028 1.721 7.771 design_ppi_1_dldesign_1_af2pred SCORE: 0.687 3.506 7.869 9.009 85.927 76.209 79.978 1.124 7.590 design_ppi_1_dldesign_2_af2pred SCORE: 0.715 3.452 7.688 9.015 87.045 76.115 80.354 1.181 7.384 design_ppi_1_dldesign_3_af2pred SCORE: 0.753 2.880 12.781 10.139 82.801 66.895 72.926 2.832 83.061 design_ppi_2_dldesign_0_af2pred SCORE: 4.713 6.647 17.966 10.022 67.370 64.381 65.514 7.578 7.448 design_ppi_2_dldesign_1_af2pred SCORE: 0.585 2.578 8.057 9.152 87.885 74.091 79.321 1.455 7.893 design_ppi_2_dldesign_2_af2pred SCORE: 5.456 8.717 21.638 9.816 66.661 65.740 66.089 10.151 7.169 design_ppi_2_dldesign_3_af2pred SCORE: 1.223 3.388 8.150 9.291 88.463 74.946 80.523 2.229 83.386 design_ppi_3_dldesign_0_af2pred SCORE: 1.084 3.811 8.760 9.382 86.099 73.772 78.858 2.097 8.280 design_ppi_3_dldesign_1_af2pred SCORE: 1.239 3.116 7.614 8.654 88.546 76.457 81.444 2.109 8.154 design_ppi_3_dldesign_2_af2pred SCORE: 1.114 4.012 8.458 9.047 85.521 74.985 79.332 1.853 8.049 design_ppi_3_dldesign_3_af2pred SCORE: 30.849 13.027 17.585 9.168 74.318 72.932 73.446 45.073 75.428 design_ppi_4_dldesign_0_af2pred SCORE: 32.573 9.324 14.422 8.683 87.638 76.735 80.771 47.214 7.665 design_ppi_4_dldesign_1_af2pred SCORE: 1.036 3.308 8.186 9.418 86.905 74.156 78.876 1.598 6.808 design_ppi_4_dldesign_2_af2pred SCORE: 1.055 3.597 8.303 9.421 85.697 73.982 78.319 1.433 7.470 design_ppi_4_dldesign_3_af2pred SCORE: 7.624 10.777 24.481 9.155 52.892 63.817 59.142 12.690 84.331 design_ppi_5_dldesign_0_af2pred SCORE: 9.163 11.825 26.501 8.199 51.977 69.317 61.897 19.410 8.484 design_ppi_5_dldesign_1_af2pred SCORE: 7.530 11.392 24.215 9.933 53.076 61.796 58.064 13.101 8.735 design_ppi_5_dldesign_2_af2pred SCORE: 2.680 7.861 13.244 9.600 68.653 68.895 68.791 3.466 9.189 design_ppi_5_dldesign_3_af2pred SCORE: 1.138 4.400 9.203 9.374 81.033 72.488 75.802 1.893 7.801 design_ppi_6_dldesign_0_af2pred SCORE: 1.178 4.334 9.951 9.740 80.572 70.593 74.463 1.821 7.961 design_ppi_6_dldesign_1_af2pred SCORE: 1.221 4.154 9.192 9.283 82.260 72.068 76.021 1.859 7.870 design_ppi_6_dldesign_2_af2pred SCORE: 9.541 13.511 20.715 10.050 63.012 65.187 64.343 14.675 8.523 design_ppi_6_dldesign_3_af2pred ... SCORE: 1.000 4.023 8.681 9.366 83.065 72.978 77.085 1.877 7.684 design_ppi_9_dldesign_3_af2pred ...

fanch1122 avatar Feb 08 '25 10:02 fanch1122

@fanch1122 use this colab: https://colab.research.google.com/drive/1lu1dC0yRltKK_Wp-gF26oHcZSCiHaI8b?usp=sharing#scrollTo=UlKef9jsnFxl

ullahsamee avatar Feb 09 '25 01:02 ullahsamee

@fanch1122 use this colab: https://colab.research.google.com/drive/1lu1dC0yRltKK_Wp-gF26oHcZSCiHaI8b?usp=sharing#scrollTo=UlKef9jsnFxl

For the input of PPBE, should I choose the pdb file predicted by the af2_initial_guess/predict.py program or the PDB file generated by dl_interface_design.py?

In addition, in the binder generation of rfdiffusion, should we select the complete pdb chain in RFdiffusion/scripts/run_inference.py ... 'contigmap.contigs=[A20-185/0 100-100]'? I found that if we truncate and use only part of it, pae_interaction may be unreliable. Because the colab script is not available, I built it locally

# @title Install all the dependecies
cd ~/soft/
git clone https://github.com/chavesejf/PBEE.git
mamba create -n PBEE python=3.10

PBEE_PATH=/public/home/design2/soft/PBEE
cd $PBEE_PATH
mamba env create -f environment.yml

conda activate pbee_env
pip install xgboost==2.0.1

pip install pyrosetta-installer
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()' #https://west.rosettacommons.org/pyrosetta/release/release/PyRosetta4.Release.python310.ubuntu.wheel/pyrosetta-2025.6+release.e5e4b278be-cp310-cp310-linux_x86_64.whl pip install /public/home/design2/.cache/pip/wheels/pyrosetta-2025.6+release.e5e4b278be-cp310-cp310-linux_x86_64.whl cd $PBEE_PATH mkdir $PBEE_PATH/trainedmodels unzip v1.1-20250209T163801Z-001.zip -d $PBEE_PATH/trainedmodels mamba install -n pbee_env ipykernel

fanch1122 avatar Feb 12 '25 08:02 fanch1122

test ΔG[bind].ipynb

import os
import csv
from IPython.display import display, HTML
import subprocess

def check_parameters(ion_dist_cutoff, mlengine, partner_1, partner_2):
    """
   check
    """
    if ion_dist_cutoff is None:
        display(HTML(f"<span style='color: red; font-size: 14px; font-weight: bold;'>Warning: There is no ion_dist_cutoff. Please, input a value! </span>"))
        return False
    elif ion_dist_cutoff < 0:
        display(HTML(f"<span style='color: red; font-size: 14px; font-weight: bold;'>Warning: Unavailable ion_dist_cutoff. Please, input a positive value!</span>"))
        return False
    elif mlengine not in ["sl", "lr", "en", "sv", "dt", "kn", "ad", "bg", "rf", "et", "xb"]:
        display(HTML(f"<span style='color: red; font-size: 14px; font-weight: bold;'>Warning: Unavailable mlengine. Please, input the models described in the details!</span>"))
        return False
    elif mlengine == "":
        display(HTML(f"<span style='color: red; font-size: 14px; font-weight: bold;'>Warning: There is no mlegine. Please, input an avaiable one! </span>"))
        return False
    elif partner_1 == "" or partner_2 == "":
        display(HTML(f"<span style='color: red; font-size: 14px; font-weight: bold;'>Warning: Empty partners. Please, fill in them correctly!</span>"))
        return False
    return True

def generate_command(frcmod_struct, frcmod_scores, ion_dist_cutoff, mlengine, partner_1, partner_2, destination_path):
    """
    command product
    """
    base_command = f'python3 /public/home/design2/soft/PBEE/pbee.py --ipdb {os.path.join(destination_path, "*.pdb")} --partner1 {partner_1} --partner2 {partner_2} --odir {os.path.join(destination_path)}'
    if frcmod_struct:
        base_command += " --frcmod_struct"
    if frcmod_scores:
        base_command += " --frcmod_scores"
    if ion_dist_cutoff != 2:
        base_command += f" --ion_dist_cutoff {ion_dist_cutoff}"
    if mlengine != "sl":
        base_command += f" --mlengine {mlengine}"
    return base_command

def run_command_and_capture_output(command):
    """
    run
    """
    try:
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        return result.stdout.strip()
    except Exception as e:
        print(f"Error running command: {e}")
        return None

# path
destination_path = "/path/to/pdb/"
os.makedirs(destination_path, exist_ok=True)

#  partner_1, partner_2 
partner_1 = "A"
partner_2 = "B"

# frcmod_struct , frcmod_scores 
frcmod_struct = False
frcmod_scores = False

# cut_off
ion_dist_cutoff = 2

# 
mlengine = "sl"

# CSV path
csv_file_path = os.path.join(destination_path, "output.csv")

# write
with open(csv_file_path, 'w', newline='') as csvfile:
    fieldnames = ['Command', 'Output']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    # 
    writer.writeheader()
    print(f"写入表头: {fieldnames}")

    # 
    for fs, fs_scores in [(False, False), (True, False), (False, True), (True, True)]:
        frcmod_struct = fs
        frcmod_scores = fs_scores

        if not check_parameters(ion_dist_cutoff, mlengine, partner_1, partner_2):
            continue

        command = generate_command(frcmod_struct, frcmod_scores, ion_dist_cutoff, mlengine, partner_1, partner_2, destination_path)
        output = run_command_and_capture_output(command)
        writer.writerow({'Command': command, 'Output': output})
        print(f"w: Command={command}, Output={output}")

output ... protocol: [3] total number of ions: 0 protocol: [3] calculating ΔG[bind] protocol: [3] ΔG[bind] = -89.928 kcal/mol (KD = 1.198957e-66 M) processing time: 00:04:56

fanch1122 avatar Feb 12 '25 08:02 fanch1122

@fanch1122 use this colab: https://colab.research.google.com/drive/1lu1dC0yRltKK_Wp-gF26oHcZSCiHaI8b?usp=sharing#scrollTo=UlKef9jsnFxl

ΔG[bind] = -9.099 kcal/mol (KD = 2.136483e-07 M), does ΔG[bind] directly represent rosetta ddG? in https://github.com/chavesejf/PBEE?tab=readme-ov-file#example-1

The example below includes the structure of an antibody (HyHEL-63) that binds to lysozyme C (PDB 1XGU) with a binding affinity of -11.28 kcal/mol. Why do we still need to screen rosetta ddG<-40

fanch1122 avatar Feb 14 '25 18:02 fanch1122

The numbers which come out of Rosetta protocols (including Rosetta ddG calculations) are typically in arbitrary units. It's been observed that these number correlate reasonably well with experimental measurements, but when you examine that correlation plot, the slope is typically not one (and the intercept may not be zero).

If you are interested in converting Rosetta calculated scores with experimental energies, the recommendation is to do a calibration plot -- run several known examples through the protocol "as if" you're running a designed structure. Then do the plot of calculated versus experimental and draw a line of best fit. Once you do that, you should have a reasonable conversion for the particular protocol settings you're using. (The exact conversion function can vary based on details of the particular settings and approach you use.)

Alternatively, you can get thresholds from someone who has either done the conversion previously, or has heuristically worked out a value that generally works for the particular protocol you're working with. ... that's where the ddG<-40 number comes from. Previous users of the protocol have observed that's a good cutoff to use to separate binders from non-binders. (But keep in mind that's not a hard and fast cutoff - depending on your system and exact protocol you run, you may see successful designs with higher values, or you may need to make the threshold more stringent to narrow down designs.)

roccomoretti avatar Feb 14 '25 20:02 roccomoretti

@roccomoretti @ullahsamee I don't seem to be able to clearly understand this meaning. In the pbee pipeline, I get the following output:

|pdb|dG_pred|total_score| |design_ppi_135_dldesign_0_cycle1|-7.264949197430281|-182.9382275696768| |design_ppi_817_dldesign_0_cycle1|-432.69906862683206|113733.31551370105|

Is such a ΔG[bind] value still credible? Or does total_score reflect a more reasonable value?

fanch1122 avatar Feb 18 '25 17:02 fanch1122

The numbers which come out of Rosetta protocols (including Rosetta ddG calculations) are typically in arbitrary units. It's been observed that these number correlate reasonably well with experimental measurements, but when you examine that correlation plot, the slope is typically not one (and the intercept may not be zero).

If you are interested in converting Rosetta calculated scores with experimental energies, the recommendation is to do a calibration plot -- run several known examples through the protocol "as if" you're running a designed structure. Then do the plot of calculated versus experimental and draw a line of best fit. Once you do that, you should have a reasonable conversion for the particular protocol settings you're using. (The exact conversion function can vary based on details of the particular settings and approach you use.)

Alternatively, you can get thresholds from someone who has either done the conversion previously, or has heuristically worked out a value that generally works for the particular protocol you're working with. ... that's where the ddG<-40 number comes from. Previous users of the protocol have observed that's a good cutoff to use to separate binders from non-binders. (But keep in mind that's not a hard and fast cutoff - depending on your system and exact protocol you run, you may see successful designs with higher values, or you may need to make the threshold more stringent to narrow down designs.)

How can you say this when you see in Baker lab papers they do experiments like making 15,000 binders and testing them by yeast display to find ones that bind: https://www.sciencedirect.com/science/article/pii/S0092867424006317?via%3Dihub

TimCraigCGPS avatar Apr 24 '25 15:04 TimCraigCGPS

|pdb|dG_pred|total_score| |design_ppi_817_dldesign_0_cycle1|-432.69906862683206|113733.31551370105|

Is such a ΔG[bind] value still credible? Or does total_score reflect a more reasonable value?

When you get ludicrously extreme ddG score values, it normally indicates there was an issue with the calculations. For ddG in particular, it's calculated as a difference between two states, so if the reference state has an issue (e.g. is calculated as being abnormally bad energy), then the bound state is going to look abnormally good in comparison. -- In your case the total score being extremely high indicates that there was an issue with the structure, and the dG_pred being highly negative just means that the unbound state model is insanely bad in comparison to the bound state model's merely crazily bad.

How can you say this when you see in Baker lab papers they do experiments like making 15,000 binders and testing them by yeast display to find ones that bind

I'm not seeing the conflict between the two. If you're asking why 15,000 designs are needed to find just a few that bind, part of that is compounding of errors. The observed correlation of the binding energy prediction to experimental results is best when you have experimentally determined structures of the bound state. If you don't, then you start to get issues with "is the bound state I'm modeling the actual bound state" and "is the monomeric structure of the design the actual structure of the design" and "does my design actually even fold?" And then there's the issue that even though there's a correlation, it's not necessarily a good correlation, and there's a bunch of outliers.

I'm not trying to imply that computational binding energy predictions are perfect. I'm just saying there's reasonable considerations to incorporating a binding energy prediction into your pipeline. (It's not a perfect metric, but those designs with predicted good binding energy are, on average, better performing that designs with predicted bad binding energies.)

roccomoretti avatar Apr 25 '25 16:04 roccomoretti