QCFractal OptimizationDataset Deduplicate Optimized Geometry

Is your feature request related to a problem? Please describe. When we build a geometry optimization dataset, it is possible that multiple conformers for the same molecule could end up with the "same" optimized geometry. It might be useful to have a feature that allows checking this duplication before pulling all the optimized geometry. This will be very helpful when the dataset contain many conformers for the same molecule, and if the tasks following such geometry optimization is expensive, (such as Hessian calculation) or the task is sensitive to duplications (such as force field fitting).

Describe the solution you'd like

A simple method for deduplication is to cluster the cartesian coordinates of all conformers of the same molecule. If the RMSD is smaller than a threshold, a duplicate is detected.
The simple method assumes that the atom indices are consistent for all conformers. However, during geometry optimization, it is possible to have atoms switch indices. In this case a more advanced alignment + index assignment method might be helpful, if we need to make sure absolutely no duplicated molecule is kept.

Jun 21 '19 20:06 yudongqiu

The QCElemental alignment tech can handle this (including atoms switching indices). See here.

I can work on adding this to the optimizationdataset tech.

Jun 24 '19 13:06 dgasmith

I think before we do this, we need some sort of tech to support conformations of a single "molecule" within the database. Right now we have the line:

ds.add_entry("HOOH1", HOOH1) # HOOH1 is a Molecule object
ds.add_entry("HOOH2", HOOH2)

to add new entries where we have hydrogen peroxide 1 and 2 which are different conformations of the same molecule. What should this interface look like? A few examples

ds.add_entry("HOOH", HOOH1, conformer=True) # or perhaps 1?
ds.add_entry("HOOH", HOOH2, conformer=True) # or perhaps 2?

ds.add_entry("HOOH1", HOOH1, parent="HOOH")
ds.add_entry("HOOH2", HOOH2, parent="HOOH")

If the first is chosen, how do we represent all unique computations as we typically show the DataFrame by entry name. This is something that would be neat to put in as we could have a line like:

ds.show_conformers("HOOH")

which would try to align all conformers and then overlay them together. This could also show the final molecules as well to get a visual of how the molecules optimized. Happy to do this, but some discussion on the API would be useful.

@ChayaSt @j-wags

Jun 28 '19 20:06 dgasmith

I think it's a good idea. This adds one additional layer, so the flat molecules->optimization become molecules->conformers->optimization. After we have the groups of conformers belong to the same molecule, do we execute the deduplication for each molecule like this? Looping over each conformer belongs to the molecule:

Get optimized geometries
Run alignment to all existing unique final geometries.
If an alignment is found that RMSD is below a threshold, this optimized geometry of this conformer is skipped.
If the no such alignment is found, add this optimized geometry to the set of "unique final geometries".

Jul 02 '19 19:07 yudongqiu

As I just realized, in the current conformer generation pipeline, we take a SMILES and generate protonation states and tautomers first, these all become "molecules"; then generate conformers for each "molecule". The geometry optimization may rearrange atoms, therefore there might still be duplications between molecules.

If we keep using the current flat structure, we can aggregate all conformers with the same chemical conformation to be "molecules", then run deduplication within each molecule.

Jul 06 '19 17:07 yudongqiu

When you say the optimization may re-arrange atoms, can you describe that in more detail? The order of atoms should not change.

Jul 08 '19 13:07 dgasmith

Sorry for the confusion, "re-arrange atoms" means the the bonds may break and form during the geometry optimization, so the original connectivity record is not valid anymore. However, when a "symmetric" break and form happens, the molecule may still be the same as before, but effectively the atom indices changed.

One example is when we optimize HCOOH, the two hydrogen might switch position, and the molecule is still the same.

Another example is if we consider two optimizations, one starts from "H-O-H", and the other starts from "H-O.......H". Both optimization will end with the same HOH molecule, even if the second optimization starts from a different "molecule".

Jul 08 '19 22:07 yudongqiu

Here is what I think could work for the current OptimizationDataset. The function could be called OptimizationDataset.get_unique_final_molecules(), and it runs through the following steps.

Go over all entries in the OptimizationDataset, put final molecules into groups based on their chemical formula.
In each group of molecules that has the same chemical formula, build a new list of "unique molecules" based on the 3d coordinates. Each molecule will be compared to all existing unique molecules in the group. For example, the first molecule will be the first unique molecule, then for the second molecule, we check if it's the same as the first, if not we add it as the second unique molecule. Then the third molecule will be compared to both the first and second unique molecules, etc.
After all groups have their list of unique molecules, we can flatten them into a list of molecule objects and return. (It might be better if we return a dictionary of {entry_id: Molecule}, so the attributes in the entry can be looked up)

In this function, to check if a molecule is the "same" as the other one, we will need to align the molecule against the other, which means we will try to re-arrange the atom indices and also rotation-translate the 3D coordinates to get a best match. After the alignment, if the RSSD of the 3D coordinates is lower than a threshold like 0.01 A, then we can say the two molecules are the "same". We should allow the user to specify this threshold.

Jul 19 '19 20:07 yudongqiu