qca-dataset-submission
qca-dataset-submission copied to clipboard
How to extract the molecules in sdf format?
Hello,
For example, from: https://github.com/openforcefield/qca-dataset-submission/blob/master/submissions/2021-07-28-OpenFF-Industry-Benchmark-Season-1-MM-v1.1/dataset.json.bz2
Is there a script to extract the 3D conformers?
Thanks, F.
Hi,
There have been lot of changes on the QCArchive side as well as openff-qcsubmit recently and I haven't caught up to all of them. If you are looking to download the industry benchmark set there are quite a few ways,
- Zenodo: Here is the benchmark set from the paper, https://zenodo.org/records/7596498.
- QCSubmit: There is an issue with units I encountered, otherwise this example would generally work and is the preferred way to retrieve large datasets.
- Directly from the QCA server using QCPortal, the user guide is available here: Create a conda environment with
mamba create -n qcf-next -c jaimergp/label/unsupported-cudatoolkit-shim -c conda-forge qcfractal openff-toolkit
conda activate qcf-next
Then, the code to retrieve results and write to an sdf file,
from qcportal import PortalClient
from pint import Unit
from openff.toolkit.topology import Molecule
from tqdm import tqdm
# Connecting to the server
client = PortalClient("https://api.qcarchive.molssi.org/")
print(client.server_name)
# requesting dataset
ds = client.get_dataset("optimization", "OpenFF Industry Benchmark Season 1 v1.1")
print(ds.specification_names)
for i, (e_name, s_name, record) in tqdm(enumerate(ds.iterate_records())):
if s_name == 'default': # the specification name 'default' corresponds to QM theory level B3LYP-D3BJ/DZVP
print(e_name, s_name, record.id, record.status)
# make sure that there is no change in connectivity after geometry optimization
# there are cases of proton transfer which would mess up the molecule if the initial connectivity is assumed
# also, avoid cases of optimization failures
if record.initial_molecule.connectivity == record.final_molecule.connectivity and record.status == 'complete':
print(record.final_molecule.symbols)
print(record.final_molecule.geometry)
print(record.final_molecule.connectivity)
# the above information is enough to write your own SDF file, optionally you can also use openff toolkit to do the job
# using mapped smiles to build the molecule and adding the final geometry from the optimization, and writing to an SDF file
mol = Molecule.from_mapped_smiles(record.final_molecule.dict()['extras']['canonical_isomeric_explicit_hydrogen_mapped_smiles'])
mol.add_conformer(record.final_molecule.dict()['geometry'] * Unit('bohr'))
# if you want to add some info like the energy or any other thing in SD Data you can use the properties field of openff molecule
SD_Data_dict = {'B3LYP/D3BJ/DZVP Energy in au': record.energies[-1], 'SD Tag': 'some value'}
mol._properties = SD_Data_dict
mol.to_file(f'mol_{i}.sdf', file_format='sdf')
if i > 0: # as a test, you can remove this to write the whole dataset
break
The industry benchmark set contains ~ 70K molecules and this is not the optimal way to write 70K SDF files and concatenating them, but this will work in the meantime. Others may chip in.