BioSimSpace
BioSimSpace copied to clipboard
Specify disulfide bridges during parameter file generation
Is your feature request related to a problem? Please describe. I have tried to prepare a system containing 4 disulfide bridges, however SS-bridge parameters were not in the output parameter file. When generating the parameter file using tleap, it is necessary to specify the SS-bridges explicitly in the following format after the pdb has been loaded.
# Load the PDBs
4lyt = loadPDB 4LYT_Fixed.pdb
# Make the disulfide bonds
bond 4lyt.6.SG 4lyt.127.SG
bond 4lyt.30.SG 4lyt.115.SG
bond 4lyt.64.SG 4lyt.80.SG
bond 4lyt.76.SG 4lyt.94.SG
With BSS, it is not possible to insert these lines in the leap configuration file because although there is a leap_commands option, extra lines are written before the system is loaded.
Describe the solution you'd like Would it be possible to have a leap_insert_flag, to decide where to insert the additional "leap_commands"? Something like "head", "pre-load", "post-load", "tail"?
Describe alternatives you've considered Or alternatively, would it be possible to have the option to specify your own tleap configuration file (as it is possible for MD configuration files)? ...Maybe this alternative would be even more useful because will allow for more flexibility.
Additional context Add any other context or screenshots about the feature request here.
hi @cespos - thanks for the post. We will discuss your suggestion and post an update (sometime next month.)
Great, thanks!
Hi @cespos. Thanks for your query, this is really helpful. The leap_commands
feature was added to solve a particular problem, i.e. the addition of custom parameters, and we obviously didn't think about all use cases, such as where you would need to inject extra commands in a different place within the LEaP
script.
As you say, ideally it would be possible to pass a fully custom config file, or modify it as you can do for a Process
. The issue is that some parameterisation functions have multi-step protocols, which call out to several backends. This means that there isn't a single configuration for the whole thing. In other cases, we might support both LEaP
and pdb2gmx
for the same FF, so the configuration options aren't universal.
I'll have a think at how to best implement this, then run some suggestions by you.
Cheers.
hi @cespos, you can also use the leap_commands
parameter to almost have a custom config file. Here is an example for your case:
import BioSimSpace as BSS
pdb = BSS.IO.readMolecules('4LYT_Fixed.pdb')
# define custom leap script except for ff14SB sourcing
custom_commands = ['4lyt = loadPDB leap.pdb', # note the use of 'leap.pdb'
'bond 4lyt.6.SG 4lyt.127.SG',
'bond 4lyt.30.SG 4lyt.115.SG',
'bond 4lyt.64.SG 4lyt.80.SG',
'bond 4lyt.76.SG 4lyt.94.SG',
'saveamberparm 4lyt custom.prm7 custom.rst7']
parm = BSS.Parameters.ff14SB(pdb.getMolecule(0), leap_commands=custom_commands)
# load the custom molecule directly from working dir
parm_mol = BSS.IO.readMolecules(f'{parm.workDir()}/custom.*')
BSS.IO.saveMolecules('4LYT_parm', parm_mol, ['rst7', 'prm7'])
I run the following to check that the S-S bonds are present in the topology:
$ cpptraj
> parm 4LYT_parm.prm7
> bondinfo :6@SG
# Mask [:6@SG] corresponds to 1 atoms.
#Bnd RK REQ Atom1 Atom2 A1 A2 T1 T2
1006 227.00 1.810 :6@CB :6@SG 96 99 2C S
1007 166.00 2.038 :6@SG :127@SG 99 1914 S S
you can see that SG in residue 6 is bonded to SG in residue 127
We should easily be able to detect the present of disulphide bonds using Sire, then generate the correct LEaP
bond command strings. For example, we could do something like following (using Sire 2023.0.0):
import BioSimSpace as BSS
from sire.legacy.Mol import Connectivity
from sire.legacy.Mol import Element
S = Element(S)
# Load the molecule.
mol = BSS.IO.readMolecules("4LYT_Fixed.pdb")[0]
# Generate the connectivity (bonding) object for the Sire molecule, then
# add this as a property.
connectivity = Connectivity(mol._sire_object)
mol._sire_object = mol._sire_object.edit().setProperty("connectivity", connectivity).molecule().commit()
# Get the bonds associated with the Sire molecule.
bonds = mol._sire_object.bonds()
# Create a list to store our disulphide bonds.
disulphides = []
# Loop over the bonds and store the disulphides.
for bond in bonds:
if bond.atom0().element() == S and bond.atom1().element() == S:
disulphides.append(bond)
We then just need to loop over the disulphides
list to create appropriately formatted bond records for LEaP
. I'll see if I can do this when I get a chance.
Okay, I've got this working on the feature-disulphide branch. The actual function that generates the records is here, which is basically the same as above with the addition of generating the record string for each bond.
The one thing that I'm concerned about is that determining the bonds via the Sire.Mol.Connectivity
object relies on sane coordinates, i.e. it uses a bond hunter to determine bonds bases on equilibrium values for given elements. If you could share the 4LYT_Fixed.pdb
file, then I could test that it generates the correct records in this case. (Any other appropriate file salong with the required bond records would be great too.)
I had a quick Google and found a file with the same name in an AMBER tutorial here. Using this, I only generate two bond records, i.e.:
import BioSimSpace as BSS
m = BSS.IO.readMolecules("4LYT_Fixed.pdb")[0]
BSS.Parameters.Protocol._protocol._get_disulphide_bonds(m._sire_object)
['bond mol.6.SG mol.127.SG', 'bond mol.64.SG mol.80.SG']
Perhaps this file is different, or maybe the Connectivity
is not a reliable way to determine the bonds. (I'll see if there's a way of tuning the bond tolerance.)
I can get it to work by tuning the tolerance, but since the bond hunter is based on covalent bonds, then I also need to tweak the maximum bond limit. I'll see what settings are most reliable. In practice, it might be best to add a new bond hunter specifically for disulphide bridges.
Could Sire parse CONECT
or SSBOND
records in a PDB when reading in a molecule? I think relying on the coordinates matching would make things very difficult if not impossible in some cases (e.g. mutating a residue to Cys and then not having the sane coordinates yet). LeAP itself uses CONECT
(so if correct CONECT
records are set, the ss bond setting is unnecessary), as an example CHARMM-GUI uses SSBOND
. Here is my fixed file for 4LYT with those records included.
Thanks, that is also an option. To date we haven't parsed those records since they are often missing or incomplete, so it's hard to detect when things are wrong. Since we rely on a well formatted PDB for paramterisation in general, I think it might be best to assume well formatted CONECT
records too. This also gives the user an easy way to specify things precisely outside of BioSImSpace too. Note that we do have code that autogenerates a CONECT
record based on the Connectivity
, but obviously this is also reliant on sane coordinates.
Ah, didn't realise that Sire also has a ChemicalBondHunter
. This appears to work perfectly (for this PDB) with the default settings. Perhaps we could autodetect things if CONECT
records aren't present, but use them when they are.
I've made some edits to Sire to make the bond searching more flexible. The user can now specify the maximum search distance and a tolerance to use when comparing equilibrium bond radii for determining bonds. I've then tuned the parameters so that it works well for a few test systems.
I'll let some other collaborators test this against some of their input in order to work out where the edge cases are. It should be easy enough to fall back on the CONECT
records, if present, or allow the user to specify things directly.
Cheers.
Hi @lohedges, Sorry I saw this only now. I will test this functionality asap and let you know how it works for me. Still feature-disulphide branch, right? Many thanks!
Yes, that's the correct branch. The way in which I search for disulphides is quite flexible to allow for non-ideal configurations. It would be good to know if it's too flexible, e.g. picking up things that it shouldn't.
I'm also working on some improved CONECT
writing ability. Currently we can generate these for all bonds, but I am tweaking to prune this to atoms non-standard residues and disulphides only. This means we might just be able to generate the CONECT
records, rather than needing to add to the tLEaP
input file. (Which approach is more robust, I'm not sure, although they use roughly the same underlying code.)
Related to other discussions here and here, I've made some improvements to the Sire.IO.PDB2
parser to improve the handling of TER
and HETATM
records,which should hopefully make parameterisation more robust.
In doing so, I have switched to writing PDBv3 compliant CONECT
records, which uses a similar approach to the one used to generate disulphide records noted above. We now write CONECT
for all mandatory fields, i.e. disulphides and inter/intra residue bonds including any atom in a non-standard residue. (Obviously this approach requires sane PDB formatting, but this is a requirement for parameterisation anyway.) I've tested this and it works correctly for the input you provide, as well as some other examples that I sourced online.
One issue that I have discovered, which would also be a problem for the previous approach, is that the atom numbers used to generate the CONECT
(or disulphide entries in the leap script) might not correspond the those in the PDB that is written, i.e. if a TER
has been injected, or if the atom has been moved to a different location and renumbered. I am working on ways to resolve this, although I don't think it will be a common problem. I am guessing that we will need to write the PDB, load it back in, then generate the CONECT
using the atom numbers from the file that was written (rather than from the original molecule).
Cheers.
Oh, and I have now added a bonds
option to appropriate parameterisation functions. This takes a list/tuple of atom pairs, which are then used to generate the LEaP
bond directives. This suffers the same issue as above, i.e. the numbers in the molecule might not correspond to those in the PDB, so will need the same workaround, whatever that is.
Okay, I've realised that I can match the atom coordinates from the Sire.System
to those in the PDB file that is written in order to obtain the correct atom numbers to use for the CONECT
records. This seems to work well and means that the CONECT
will be valid regardless of what atom re-ordering occurs on write. Will test a little further then push an update.
Hi @lohedges, I have seen that the feature-disulphide branch was merged into devel. Today, I have installed the development branch and wanted to test the newly added feature. Could you please send me some instructions? Thanks!
Assuming that you have a correctly formatted PDB file, the code should now autodetect disulphide bonds and add them to the LEaP
script for you. This has been tested on a range of inputs and has been found to work well. If you find that it's not locating the bonds correctly, e.g. if the structure is a bit weird, then there a a few options:
- Change the bond search tolerance or the maximum search distance.
- Explicitly specify the atoms to bond by passing a list (or tuple) of atom pairs. (If any of these are found by the bond search, then they will be removed to avoid duplicates.)
See here for the documentation for ff14SB
.
There's also a unit test here that checks that the parameterisation works and that the resulting AMBER topology does contain the required disulphide bonds.
I tried but it did not work. Not sure whether I have the correct BioSimSpace version
The current devel
is version 2023.0.0+137.g63b14996
so you seem to be using an older version. (114 commits behind.)
Weird, I installed it today following the instructions. I'll figure it out and get back to you.
It's possible that it's pulling in an older version because it doesn't want to change the version of Sire in your environment. To force it to use the latest you can do:
conda install -c conda-forge -c michellab/label/dev biosimspace sire=2023.0.2
Or create an environment directly (which is recommended):
conda create -n bss -c conda-forge -c michellab/label/dev biosimspace sire=2023.0.2
This worked:
conda install -c conda-forge -c michellab/label/dev biosimspace=2023.0.0=py39_137
The commands are now working... at least partially.
tleap however failed. The problem now is that tleap residues/atoms numbering should start from 1 and should be continuous. When I checked the leap.pdb in the temporary work_dir, I found that residues numbers were not continuous (no residue 16):
As a consequence, the bond specifications are not working. A similar issue will occur when proteins have multiple chains and there are disulfide bridges involving the different chains. The solution for those cases is to first run tleap to generate a processed pdb file (with continuous residues/atoms numbers). Then, check whether there are any disulfide bridges and generate the tleap bond commands. Finally, run tleap again to parametrize the protein with disulfide bonds. See also this.
This is because your original input PDB file does not have continuous residue numbers. With BioSimSpace we make the assumption that the user input file is the source of truth and has been formatted appropriately. As such, we write back things like residue numbers as is, since the user might want to cross-reference things later down a pipeline. (We don't change things behind the users back.)
If this is a common issue with PDB input files, then could you provide some examples (particularly the multiple chain issue you mention). It might be possible to fix the files, i.e. make things continuous, then remap to the original numbering after the parameterisation is complete. (We already make use of a compatibility function that maps the output topology of tLEaP
back into the one that was passed in, i.e. preserving naming etc.)
Hi @lohedges, Sorry again for taking a long time to follow up. This is the example of a protein that has disulphide bridges in two different chains, 3G8K. While this is the example of a protein having a bridge between two different chains, 5Y42.
I was thinking... Assuming that the user has not preprocessed the PDB file to have continuous residues numbers, could it be a solution to write the tleap bond lines extracting from the molecule sire_object the residue indices instead of the residue names?
Thanks for this, I'll look into it next week.
Assuming that the user has not preprocessed the PDB file to have continuous residues numbers, could it be a solution to write the tleap bond lines extracting from the molecule sire_object the residue indices instead of the residue names?
Sorry, perhaps I'm misunderstanding something here, the bonds do you mean use the residue indices instead of the number used in the original PDB, rather than the name? If LEaP
just cares about the index, rather than cross-referencing it with the PDB, then this should be a very easy fix.