mmpdb
mmpdb copied to clipboard
mmpdb transform behaves unexpectedly
The transform rules in mmpdblib appears to miss some apparent cases.
A test case with the following structures:
OC(c(cccc1)c1O)=O mol1
CCCCCCCC(c(cc1)cc(C(O)=O)c1O)=O mol2
CCCCCC(c(cc1)cc(C(O)=O)c1O)=O mol3
with some properties:
ID prop
mol1 0.0
mol2 1.0
mol3 1.5
I performed the fragmentation, index and property loading as instructed.
python -m mmpdblib fragment test_struct.tsv --max-rotatable-bonds 20 --num-cuts 3 -o test.fragments
python -m mmpdblib index test.fragments -o test.mmpdb
python -m mmpdblib loadprops --properties test_prop.tsv test.mmpdb
The indexed pairs makes sense.
However, when I run:
python -m mmpdblib transform --smiles 'OC(c(cccc1)c1O)=O' test.mmpdb --explain
I noticed that I cannot get mol2 or mol3, where the rules mol1->mol2 and mol1->mol3 is included in the index step. Did I miss something here? Thank you for your help.
Here's the explanation output:
WARNING: APSW not installed. Falling back to Python's sqlite3 module.
Processing fragment Fragmentation(1, 'N', 7, '1', '*c1ccccc1O', '0', 3, '1', '*C(=O)O', 'O=CO')
variable '*c1ccccc1O' not found as SMILES '[*:1]c1ccccc1O'
No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 3, '1', '*C(=O)O', '0', 7, '1', '*c1ccccc1O', 'Oc1ccccc1')
variable '*C(=O)O' not found as SMILES '[*:1]C(=O)O'
No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*C(=O)O.*O', None)
variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 1, '1', '*O', '0', 9, '1', '*c1ccccc1C(=O)O', 'O=C(O)c1ccccc1')
variable '*O' not found as SMILES '[*:1]O'
No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 9, '1', '*c1ccccc1C(=O)O', '0', 1, '1', '*O', 'O')
variable '*c1ccccc1C(=O)O' not found as SMILES '[*:1]c1ccccc1C(=O)O'
No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*O.*C(=O)O', None)
variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
No matching rule SMILES found. Skipping fragment.
== Product SMILES in database: 0 ==
ID SMILES prop_from_smiles prop_to_smiles prop_radius prop_fingerprint prop_rule_environment_id prop_count prop_avg prop_std
prop_kurtosis prop_skewness prop_min prop_q1 prop_median prop_q3 prop_max prop_paired_t prop_p_value
I believe what's happening is that transform works on the variable part, but hydrogens aren't treated as the variable *[H]
but instead are treated as a special case.
If so, I don't remember if transformation from a hydrogen was deliberately not included in the "transform" operation, or if it was an oversight.
As Jérôme and Christian point out, hydrogen transformations were explicitly not included as there would be too many.
The transform
option lets you specify a specific hydrogen to consider, by denoting it with an explicit [H]
in the SMILES string.
However, that code path has not been used for years and it does not work in the main mmpdb release. (RDKit changed its wildcard representation from [*]
to *
about five years ago, and mmpdb used a hard-coded [*][H]
to recognize the cut hydrogen SMILES fragment.)
The fixed code is available in the v3 development version, available from https://github.com/adalke/mmpdb/tree/v3-dev .
Hi @adalke , thank you for your help. I will try the v3-dev
version of mmpdb.