mmpdb icon indicating copy to clipboard operation
mmpdb copied to clipboard

mmpdb transform behaves unexpectedly

Open mu-wang opened this issue 3 years ago • 5 comments

The transform rules in mmpdblib appears to miss some apparent cases.

A test case with the following structures:

OC(c(cccc1)c1O)=O	 mol1
CCCCCCCC(c(cc1)cc(C(O)=O)c1O)=O	mol2
CCCCCC(c(cc1)cc(C(O)=O)c1O)=O	mol3

with some properties:

ID	prop
mol1	0.0
mol2	1.0
mol3	1.5

I performed the fragmentation, index and property loading as instructed.

python -m mmpdblib fragment test_struct.tsv --max-rotatable-bonds 20 --num-cuts 3 -o test.fragments
python -m mmpdblib index test.fragments -o test.mmpdb
python -m mmpdblib loadprops --properties test_prop.tsv test.mmpdb

The indexed pairs makes sense.

However, when I run:

python -m mmpdblib transform --smiles 'OC(c(cccc1)c1O)=O' test.mmpdb --explain

I noticed that I cannot get mol2 or mol3, where the rules mol1->mol2 and mol1->mol3 is included in the index step. Did I miss something here? Thank you for your help.

Here's the explanation output:

WARNING: APSW not installed. Falling back to Python's sqlite3 module.
Processing fragment Fragmentation(1, 'N', 7, '1', '*c1ccccc1O', '0', 3, '1', '*C(=O)O', 'O=CO')
  variable '*c1ccccc1O' not found as SMILES '[*:1]c1ccccc1O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 3, '1', '*C(=O)O', '0', 7, '1', '*c1ccccc1O', 'Oc1ccccc1')
  variable '*C(=O)O' not found as SMILES '[*:1]C(=O)O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*C(=O)O.*O', None)
  variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
  variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 1, '1', '*O', '0', 9, '1', '*c1ccccc1C(=O)O', 'O=C(O)c1ccccc1')
  variable '*O' not found as SMILES '[*:1]O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 9, '1', '*c1ccccc1C(=O)O', '0', 1, '1', '*O', 'O')
  variable '*c1ccccc1C(=O)O' not found as SMILES '[*:1]c1ccccc1C(=O)O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*O.*C(=O)O', None)
  variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
  variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
  No matching rule SMILES found. Skipping fragment.
== Product SMILES in database: 0 ==
ID      SMILES  prop_from_smiles        prop_to_smiles  prop_radius     prop_fingerprint      prop_rule_environment_id        prop_count      prop_avg        prop_std      
  prop_kurtosis prop_skewness   prop_min        prop_q1 prop_median     prop_q3 prop_max      prop_paired_t   prop_p_value

mu-wang avatar Feb 05 '22 12:02 mu-wang

I believe what's happening is that transform works on the variable part, but hydrogens aren't treated as the variable *[H] but instead are treated as a special case.

If so, I don't remember if transformation from a hydrogen was deliberately not included in the "transform" operation, or if it was an oversight.

adalke avatar Apr 12 '22 07:04 adalke

As Jérôme and Christian point out, hydrogen transformations were explicitly not included as there would be too many.

The transform option lets you specify a specific hydrogen to consider, by denoting it with an explicit [H] in the SMILES string.

However, that code path has not been used for years and it does not work in the main mmpdb release. (RDKit changed its wildcard representation from [*] to * about five years ago, and mmpdb used a hard-coded [*][H] to recognize the cut hydrogen SMILES fragment.)

The fixed code is available in the v3 development version, available from https://github.com/adalke/mmpdb/tree/v3-dev .

adalke avatar Apr 12 '22 14:04 adalke

Hi @adalke , thank you for your help. I will try the v3-dev version of mmpdb.

mu-wang avatar Jun 07 '22 19:06 mu-wang