posecheck
posecheck copied to clipboard
unexpected behavior when calculating key interactions, and asking for help when using your code
Hi:
Recently I've been using PoseCheck to evaluate clash, strain energy and key interactions for SBDD models. And I found the following unexpected behaviors:
- When loading CrossDocked test set proteins, a ValueError occurs:
vdw radii for types: CO. These can be defined manually using the keyword 'vdwradii'
, which results in protein loading failed. I did a quick search to resolve this bug but cannot find relavant solutions, except that copilot suggests inserting the following code:
vdw_radii = {'CO': 1.7} # Define the vdw radius for 'CO'. The value 1.7 is just an example, replace it with the correct value.
ag.guess_bonds(vdwradii=vdw_radii)
I'm not sure whether 1.7 is correct for CO, and what is ag
, how to call guess_bonds
?
- Since doing PoseCheck on >10K molecules is quite slow, multi-processing is necessary. An error would occur if multiple processes load the same protein at the same time. After some debugging, I found the following code (starting at line 60 in posecheck/utils/loading.py):
tmp_path = pdb_path.split(".pdb")[0] + "_tmp.pdb"
# Call reduce to make tmp PDB with waters
reduce_command = f"{reduce_path} -NOFLIP {pdb_path} -Quiet > {tmp_path}"
subprocess.run(reduce_command, shell=True)
# Load the protein from the temporary PDB file
prot = load_protein_prolif(tmp_path)
os.remove(tmp_path)
I guess when multiple processes accessing the same tmp_path
, a process may unexpectedly remove the file before another process load it. Thus I made the following quick fix:
tmp_path = pdb_path.split(".pdb")[0] + "_tmp.pdb"
while os.path.exists(tmp_path):
hash_code = str(hash(tmp_path))[:4]
tmp_path = tmp_path[:-8] + '_' + hash_code + "_tmp.pdb"
print(tmp_path)
This quick fix works for me. Please check if this is correct and consider update your source code for better multi-process support.
Except for those bugs, I will appreciate your help in the following cases:
- In your README tips, you said
Reading and processing all the PDB files using reduce can take a while for a large test set. If you are running PoseCheck frequently, it might be worth pre-processing all proteins yourself using prot = posecheck.utils.loading.load_protein_from_pdb(pdb_path) and setting this directly within PoseCheck using pc.protein = prot.
I guess you are mentioning the following code which could be quite time-consuming:
# Call reduce to make tmp PDB with waters
reduce_command = f"{reduce_path} -NOFLIP {pdb_path} -Quiet > {tmp_path}"
subprocess.run(reduce_command, shell=True)
How about adding an interface to preprocess all protein files? And I'm curious about how much time can be saved if proteins are preprocessed?
- Can you provide some formal code for calculating the exact numbers for key interactions? Since your interface
pc.calculate_interactions()
returns a complicated dataframe, I'm not sure whether my parsing is correct:
df = pc.calculate_interactions()
columns = np.array([column[2] for column in df.columns])
flags = np.array([df[column][0] for column in df.columns])
def count_inter(inter_type):
if len(columns) == 0:
return 0
count = sum((columns == inter_type) & flags)
return count
# ['Hydrophobic', 'HBDonor', 'VdWContact', 'HBAcceptor']
hb_donor = count_inter('HBDonor')
hb_acceptor = count_inter('HBAcceptor')
vdw = count_inter('VdWContact')
hydrophobic = count_inter('Hydrophobic')
Please verify my code and provide some formal guidelines.
- When using PoseCheck, many warning messages will be printed like this:
/opt/conda/lib/python3.9/site-packages/MDAnalysis/topology/guessers.py:146: UserWarning: Failed to guess the mass for the following atom types: CO
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
/opt/conda/lib/python3.9/site-packages/MDAnalysis/converters/RDKit.py:473: UserWarning: No `bonds` attribute in this AtomGroup. Guessing bonds based on atoms coordinates
warnings.warn(
WARNING: atom ZN from ZN will be treated as zinc
*WARNING*: Residues GLN 84 and HIS 90 in chain B appear unbonded
and will be treated as a chain break
*WARNING*: Residues GLN 84 and HIS 90 in chain B appear unbonded
and will be treated as a chain break
*WARNING*: Residues LEU 98 and ASP 101 in chain B appear unbonded
and will be treated as a chain break
*WARNING*: Residues LEU 98 and ASP 101 in chain B appear unbonded
and will be treated as a chain break
*WARNING*: Res "ZN" not in HETATM Connection Database. Hydrogens not added.
*WARNING*: Res "K" not in HETATM Connection Database. Hydrogens not added.
*WARNING*: Res "K" not in HETATM Connection Database. Hydrogens not added.
I think this information doesn't change the results, right? Do you know how to suppress these warning message? A verbose
flag would be quite useful if this message can be turned off.
- I think many of your followers are using CrossDocked dataset. A formal and efficient evaluation script for CrossDocked (many proteins, each protein associated with many molecules) would be beneficial. I would like to discuss with you and contribute to this script.
If you want discussion, drop me an email at kevinqu16 [at] gmail [dot] com. That's all. Thank you!