RMG-Py
RMG-Py copied to clipboard
`get_deterministic_sssr` is not really deterministic
A comment was added after creating this issue: There was a relevant issue before. See #1027.
1. Bug Description
The bug was identified by the witness of @jonwzheng and @hwpang when searching thermo for C1=CC2C=CC=1C=C2
, the result was non-deterministic (try refreshing the page). The issue was first mentioned in #2490. However, this is not actually a decision tree issue, so I decided to make it a separate PR.
2. Reproducing the issue with RMG APIs
import os.path as osp
import rmgpy
from rmgpy.molecule.molecule import Molecule
from rmgpy.data.thermo import ThermoDatabase
tmdb = ThermoDatabase()
tmdb.load(osp.join(rmgpy.settings["database.directory"], "thermo"))
smi = "C1=CC2C=CC=1C=C2"
mol = Molecule().from_smiles(smi)
print(tmdb.compute_group_additivity_thermo(spc.molecule[0]).comment.split(":")[2])
The above block should be the minimal code example to reproduce the issue. You may see different thermo estimation comments from the execution. I found that after import
is done, the result is deterministic. So In order to get different results, one needs to restart the kernel and import the module before actually calling the function.
3. Investigate get_deterministic_sssr
Then, I would like to briefly explain the results and the potential cause of its undeterminism.
I trace down the function calls and believe the real cause is that get_deterministic_sssr
is not deterministic, at least for the investigating molecule.
The call stack is:
within ThermoDatabase
get_all_thermo_data
> get_thermo_data_from_groups
> estimate_thermo_via_group_additivity
> compute_group_additivity_thermo
> _add_polycyclic_correction_thermo_data
> get_bicyclic_correction_thermo_data_from_heuristic
> split_bicyclic_into_single_rings
> bicyclic_submol.get_deterministic_sssr()
I typically see two kinds of results out of get_deterministic_sssr
:
SSSR result 1:
[[0, 4, 5, 7, 6, 2], [2, 1, 3, 5, 7, 6]]
The corresponding molecule graph:
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)
So both of the two rings identified involve three double bonds ([0,4],[5,7], [7,6]) and ([1,3],[5,7],[7,6])
SSSR result 2:
[[2, 1, 4, 3, 0, 5], [0, 3, 4, 1, 7, 6]]
The corresponding molecule graph:
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)
Only the first ring has three double bonds ([2,1],[4,3], [5,2]). The second ring only has two double bonds ([3,4],[7,6]).
Within get_deterministic_sssr
, the different results are due to nondeterminism when choosing the so-called root_vertex
:
https://github.com/ReactionMechanismGenerator/RMG-Py/blob/3b0b82fc2907a5618c5753eacd443258105c7876/rmgpy/molecule/molecule.py#L2608-L2613
The returned value by sorted(root_candidates_tups, key=lambda tup0: tup0[1:], reverse=True)
[(<Atom 'C'>, -603, 603),
(<Atom 'C'>, -603, 603),
(<Atom 'C'>, -603, 603),
(<Atom 'C'>, -603, 603),
(<Atom 'C'>, -603, 603),
(<Atom 'C'>, -603, 603),
(<Atom 'C'>, -879, 879),
(<Atom 'C'>, -879, 879)]
Basically, the root vertexes are iteratively picked according to the connectivity value (get_vertex_connectivity_value
), and connectivity values do not take bond orders into account (basically convolutionally counting the number of neighbors). Like the molecule in this example, the atoms on the bridge share the same values for a symmetric skeleton. Therefore, the sorting does not distinguish the atom with/without double bonds...
Then, In the case where the first picked root vertex is on the bridge involving two double bonds (e.g., carbon 7 or 8, see figure below), the smallest ring involving this bridge has 3 double bonds. However, this bridge will be removed after finding the smallest ring, causing the second ring to only have 2 double bonds instead of 3; instead, if the first picked root vertex is on the bridge involving only one double bond (e.g., carbon 1,2,4,5), then both smallest ring will have 3 double bonds, as the algorithm will maximize the total bond order of the ring https://github.com/ReactionMechanismGenerator/RMG-Py/blob/3b0b82fc2907a5618c5753eacd443258105c7876/rmgpy/molecule/molecule.py#L2622-L2632
4. Workaround
In the docstring, get_smallest_set_of_smallest_rings
is recommended in "new" applications. However, a preliminary test shows it still suffers from the same issue.
I've also considered adding more sorting keys when choosing root vertex, e.g., considering bond orders. But couldn't figure out a good way to really avoid determinism...
It is possible to use get_relevant_cycles
, which is deterministic, and choose a set of rings from the relevant cycles. It involves a more complicated rewrite and possibly redesigning the algorithm, and we still need to be cautious when selecting the rings.