qca-dataset-submission icon indicating copy to clipboard operation
qca-dataset-submission copied to clipboard

How to extract the molecules in sdf format?

Open UnixJunkie opened this issue 2 years ago • 1 comments
trafficstars

Hello,

For example, from: https://github.com/openforcefield/qca-dataset-submission/blob/master/submissions/2021-07-28-OpenFF-Industry-Benchmark-Season-1-MM-v1.1/dataset.json.bz2

Is there a script to extract the 3D conformers?

Thanks, F.

UnixJunkie avatar Nov 10 '23 02:11 UnixJunkie

Hi,

There have been lot of changes on the QCArchive side as well as openff-qcsubmit recently and I haven't caught up to all of them. If you are looking to download the industry benchmark set there are quite a few ways,

  1. Zenodo: Here is the benchmark set from the paper, https://zenodo.org/records/7596498.
  2. QCSubmit: There is an issue with units I encountered, otherwise this example would generally work and is the preferred way to retrieve large datasets.
  3. Directly from the QCA server using QCPortal, the user guide is available here: Create a conda environment with
mamba create -n qcf-next -c jaimergp/label/unsupported-cudatoolkit-shim -c conda-forge qcfractal openff-toolkit
conda activate qcf-next

Then, the code to retrieve results and write to an sdf file,

from qcportal import PortalClient
from pint import Unit
from openff.toolkit.topology import Molecule
from tqdm import tqdm

# Connecting to the server
client = PortalClient("https://api.qcarchive.molssi.org/")
print(client.server_name)

# requesting dataset
ds = client.get_dataset("optimization", "OpenFF Industry Benchmark Season 1 v1.1")

print(ds.specification_names)

for i, (e_name, s_name, record) in tqdm(enumerate(ds.iterate_records())):
  if s_name == 'default': # the specification name 'default' corresponds to QM theory level B3LYP-D3BJ/DZVP
      print(e_name, s_name, record.id, record.status)
      
      # make sure that there is no change in connectivity after geometry optimization
      # there are cases of proton transfer which would mess up the molecule if the initial connectivity is assumed
      # also, avoid cases of optimization failures
      if record.initial_molecule.connectivity == record.final_molecule.connectivity and record.status == 'complete':
          print(record.final_molecule.symbols)
          print(record.final_molecule.geometry)
          print(record.final_molecule.connectivity)

         # the above information is enough to write your own SDF file, optionally you can also use openff toolkit to do the job

          # using mapped smiles to build the molecule and adding the final geometry from the optimization, and writing to an SDF file
         
          mol = Molecule.from_mapped_smiles(record.final_molecule.dict()['extras']['canonical_isomeric_explicit_hydrogen_mapped_smiles'])
          mol.add_conformer(record.final_molecule.dict()['geometry'] * Unit('bohr'))
          # if you want to add some info like the energy or any other thing in SD Data you can use the properties field of openff molecule          
          SD_Data_dict = {'B3LYP/D3BJ/DZVP Energy in au': record.energies[-1], 'SD Tag': 'some value'}
          mol._properties = SD_Data_dict
          mol.to_file(f'mol_{i}.sdf', file_format='sdf')

      if i > 0: # as a test, you can remove this to write the whole dataset
          break

The industry benchmark set contains ~ 70K molecules and this is not the optimal way to write 70K SDF files and concatenating them, but this will work in the meantime. Others may chip in.

pavankum avatar Nov 10 '23 07:11 pavankum