qca-dataset-submission icon indicating copy to clipboard operation
qca-dataset-submission copied to clipboard

Some molecules in industry benchmark set have invalid CMILES

Open j-wags opened this issue 2 years ago • 1 comments

cc https://github.com/openforcefield/qca-dataset-submission/pull/207 cc https://github.com/openforcefield/openff-qcsubmit/pull/228 cc https://github.com/openforcefield/openff-toolkit/issues/1696

The pattern [NH+:1] shouldn't be valid in CMILES, since it doesn't identify which map value the attached hydrogen gets. However, a few entries with this problem seem to have snuck into the industry benchmark set.

from qcportal import FractalClient
client = FractalClient()

col = client.get_collection(collection_type="OptimizationDataset", name = "OpenFF Industry Benchmark Season 1 v1.1")
bad = [i for i,j in col.data.records.items() if "NH+" in j.attributes["canonical_isomeric_explicit_hydrogen_mapped_smiles"]]

An example CMILES with this problem is [F:1][c:2]1[c:3]([H:32])[c:4]([H:33])[c:5]([H:34])[c:6]([F:7])[c:8]1[C:9]1=[N:12][N:13]2[C:14](=[C:15]([H:37])[N:16]=[C:17]2[N:18]([c:19]2[c:20]([H:39])[nH+:21][c:22]([H:40])[c:23]([H:41])[c:24]2[N:25]2[C:26]([H:42])([H:43])[C@:30]([NH+:31]([H:51])[H:52])([H:50])[C:29]([H:48])([H:49])[C:28]([H:46])([H:47])[C:27]2([H:44])[H:45])[H:38])[C:11]([H:36])=[C:10]1[H:35]

j-wags avatar Aug 16 '23 02:08 j-wags

So it isn't lost to the sands of time - I think there was a dump of this data produced at some point and modified to fix these CMILES. And instead of pulling down the entire dataset from QCArchive, that JSON is used as a starting point for analysis, etc.

mattwthompson avatar Mar 15 '24 14:03 mattwthompson