openff-qcsubmit Not all records being found for Hessian Dataset when converting from Optimization Dataset

I have an optimization dataset, and a corresponding single point dataset with Hessians for all the final molecules. Typically we fetch the hessian records using dataset.to_basic_result_collection(driver="hessian"), but this is not fetching all the Hessian records, even though they should exist.

Reproducing example:

from qcportal import PortalClient
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.results import BasicResultCollection

client = PortalClient(address="https://api.qcarchive.molssi.org:443/")

opt_set = OptimizationResultCollection.from_server(client = client, datasets = ['OpenFF Sulfur Optimization Training Coverage Supplement v1.0'],spec_name='default')

opt_to_hess_set = opt_set.to_basic_result_collection(driver="hessian")

print(opt_to_hess_set.n_results) # 260

print(opt_set.n_results) # 899

hess_set = BasicResultCollection.from_server(client=client, datasets=['OpenFF Sulfur Hessian Training Coverage Supplement v1.0'],spec_name='default')

print(hess_set.n_results) # 899

# Check whether all the Basic Results actually have Hessians
hess_rec = hess_set.to_records()

n_hess = 0
for record, molecule in hess_rec:
    if len(record.return_result)>1:
        n_hess += 1
    

print(n_hess) # 899

n_hess = 0
for record, molecule in hess_rec:
    if record.specification.driver == 'hessian':
        n_hess += 1
    

print(n_hess) # 899

Expected behavior:

test_opt = OptimizationResultCollection.from_server(client=client,datasets = ['OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy'],spec_name='default')

test_opt_to_hess = test_opt.to_basic_result_collection(driver="hessian")

print(test_opt.n_results) # 197

test_opt_to_hess.n_results # 197

Oct 02 '24 19:10 amcisaac

cc @ntBre

Oct 02 '24 19:10 amcisaac

I can reproduce this too. I think this is likely a problem with the dataset itself or its representation on QCArchive rather than a bug in qcsubmit. Extracting some of the code from OptimizationResultCollection.to_basic_result_collection shows that the server only has 260 Hessian records for the final molecule IDs from the optimization dataset:

from openff.qcsubmit.results import (
    BasicResultCollection,
    OptimizationResultCollection,
)
from openff.qcsubmit.utils import _CachedPortalClient, portal_client_manager

client = _CachedPortalClient("https://api.qcarchive.molssi.org:443", ".")
opt_name = "OpenFF Sulfur Optimization Training Coverage Supplement v1.0"
hess_name = "OpenFF Sulfur Hessian Training Coverage Supplement v1.0"

with portal_client_manager(lambda _: client):
    opt = OptimizationResultCollection.from_server(client, [opt_name])

    mol_ids = [rec.final_molecule_id for rec, _mol in opt.to_records()]
    print(len(mol_ids)) # => 899
    sp_records = list(client.query_singlepoints(
        molecule_id=mol_ids, driver="hessian"
    ))
    print(len(sp_records)) # => 260

The big open question to me is whether or not using OptimizationResultCollection.create_basic_dataset (which I didn't know about before today) would have fixed this issue from the initial submission, where I effectively did a manual implementation of that method.

As far as I can tell, create_basic_dataset only propagates the extras and specifications.keywords fields from the underlying optimization record, so I don't think it would have helped. In fact, diffing the dataset in qca-dataset-submission with the dataset created with this code:

    bds = opt.create_basic_dataset("tmp", "descdesc", "tagtagtag", driver="hessian")
    bds.export_dataset("/tmp/bds.json")

shows exactly that: additional keywords are present but otherwise the molecule entries are unchanged. If the different keywords caused the mismatch, I would expect none of the records here to match up instead of some fraction of them. So I think the root cause is QCArchive failing to identify these molecules as corresponding to the same parent records.

Oct 02 '24 20:10 ntBre

I've gone through and checked that each optimization record has a corresponding hessian record with matching SMILES string and RMSD (all-atom, no symmetry check) < 0.00001 A. So it shouldn't be an issue with incorrect molecules.

I did notice that not all of the records are in the same order between the two datasets, but only 79 are out of order and they are randomly spread throughout the dataset so I don't think it's like, processing the first 260 that are in order and then having issues due to an order change.

Oct 02 '24 23:10 amcisaac

Update: at @lilyminium 's suggestion I looked at the differences in the absolute coordinates, it looks like requiring a 1e-9 cutoff for the difference between coordinates in the two datasets leads to 639 molecules coming up as "not the same" e.g.:

np.any(np.abs(hess_geom - opt_geom).magnitude > 0.000000001)

I don't know what QCA uses as their criterion, but I'd guess this is probably related to QCA's criterion for considering molecules "the same"

Oct 03 '24 00:10 amcisaac

Assuming QCA uses the qcelemental.models.Molecule class to represent molecules, their __eq__ implementation compares the hashes of the molecules, which includes a hashed version of the geometry rounded to 8 decimal places. At first I thought that didn't align with the 1e-9 number you found, but now I think I've convinced myself that rounding to 8 places is the same as comparing 9 places, at least when one of them rounds up. That could also explain why only some records are affected. The coordinates are likely close in the 9th decimal place, but sometimes rounding will cause them to differ in the 8th place. So really it would be better if QCA used your formula above instead of this hashing approach.

a = np.array([4.5e-9])
b = np.array([5.1e-9])
a - b  # => array([-6.e-10])
np.any(np.abs(a - b) > 1e-9) # => np.False_ indicating equality
np.around(a, 8) == np.around(b, 8) # => array([False]), indicating inequality

This is a slightly confusing example since both of the last two lines print false, but the first test is false for the difference being greater than 1e-9, making the arrays equal, while the second test is the one done by QCElemental showing two molecules differing by 6e-10 to be different (not equal).

Oct 03 '24 18:10 ntBre

Yeah that makes a lot of sense, if one rounds up and one rounds down, two numbers that differ in the 9th decimal place wouldn't pass that check (as you demonstrated). When I printed out the coordinate differences, most were on the order of 1e-9, so that makes sense. I think this is definitely an issue with QCA

Oct 04 '24 01:10 amcisaac