Enzyme Design Example
I'm consistently seeing issues with strange bonds and disconnected atoms in my RFdiffusion3 outputs for motif scaffolding, particularly around join points and side chains. I'm trying to understand if this is expected behavior that should be fixed with post-processing (relaxation), or if it indicates a problem with the coordinate generation.
Motif:
Generated design:
Configuration File: { "test": { "input": "/home/ubuntu/input.pdb", "length": "180-200", "unindex": "A108,A139,A152,A156", "ligand": "NAI,ACT", "select_fixed_atoms": { "A108": "ND2,CG", "A139": "OG,CB,CA", "A152": "OH,CZ", "A156": "NZ,CE,CD", "ACT": "OXT", "NAI": "" } } }
Inference command: rfd3 design out_dir=/home/ubuntu/out inputs=/home/ubuntu/config.json ckpt_path=/home/ubuntu/rfd3_latest.ckpt diffusion_batch_size=1
Logs:
/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py:216: UserWarning: provider=hydra.searchpath in main, path=configs is not available.
warnings.warn(
Environment variable CCD_MIRROR_PATH not set. Will not be able to use function requiring this variable. To set it you may:
(1) add the line 'export VAR_NAME=path/to/variable' to your .bashrc or .zshrc file
(2) set it in your current shell with 'export VAR_NAME=path/to/variable'
(3) write it to a .env file in the root of the atomworks.io repository
Environment variable PDB_MIRROR_PATH not set. Will not be able to use function requiring this variable. To set it you may:
(1) add the line 'export VAR_NAME=path/to/variable' to your .bashrc or .zshrc file
(2) set it in your current shell with 'export VAR_NAME=path/to/variable'
(3) write it to a .env file in the root of the atomworks.io repository
04:31:36 DEBUG transforms: Debug mode is on
04:31:38 INFO rfd3.engine: [rank: 0] Outputs will be written to /home/ubuntu/out.
04:31:38 INFO rfd3.engine: [rank: 0] Prevalidating design specification for example: config_4njej
04:31:38 WARNING atomworks.io: We can't fix formal charges without building from templates, as we need to know the true number of hydrogens bonded to a given atom, not the inferred number. This may lead to occasional inaccuracies after adding inter-residue bonds. To avoid this and fix formal charges, set add_missing_atoms = True.
04:31:38 WARNING atomworks.io: Chain A contains both polymer and non-polymer residues; separating them for processing, naming the non-polymer residues as B.
04:31:38 INFO rfd3.engine: [rank: 0] Found 0 existing example IDs in the output directory.
Using bfloat16 Automatic Mixed Precision (AMP)
04:31:48 WARNING atomworks.io: We can't fix formal charges without building from templates, as we need to know the true number of hydrogens bonded to a given atom, not the inferred number. This may lead to occasional inaccuracies after adding inter-residue bonds. To avoid this and fix formal charges, set add_missing_atoms = True.
04:31:48 WARNING atomworks.io: Chain A contains both polymer and non-polymer residues; separating them for processing, naming the non-polymer residues as B.
04:31:48 WARNING atomworks.io: The extra_fields argument will be ignored if there is no CIF file input.
04:32:58 INFO rfd3.engine: [rank: 0] Finished inference batch in 67.86 seconds.
04:32:58 INFO rfd3.engine: [rank: 0] Outputs for config_4njej_0_model_0 written to /home/ubuntu/out/config_4njej_0_model_0.
foundry commit: 395737750a15ec7caca7223197cf57bf68466147
Hi!
It's quite normal that for complex active sites it'll take a few tries, for that reason there's the join_point_rmsd and insertion_rmsd which gets logged to the json which can be helpful for picking out successful scaffolds.
I'd include some extra atoms in your config, that can sometimes help create fewer mangled outputs, for example:
"select_fixed_atoms": {
"A108": "TIP",
"A139": "OG,CB,CA",
"A152": "OH,CZ,CG,CB", // Same geometric constraints in practise, but provides more explicit information to the model about the ring
"A156": "NZ,CE,CD",
"ACT": "ALL", // Solvents get filtered out during training so can be good to fix them if you can
"NAI": ""
}
The config otherwise looks good. A second alternative which we've found can help is to use classifier free guidance on the unindexed components (takes twice as many forward passes) which can be done with inference_sampler.use_classifier_free_guidance=True but is generally case dependent.