ANI1_dataset
ANI1_dataset copied to clipboard
Potential smile/coordinate discrepency
Hello,
I was trying to convert the ANI-1 dataset into a parquet format, and I ran into a potential mismatch between the coordinates and smiles string of at least one molecule (around 4k conformers).
I wrote a piece of sample code to try to isolate this first issue I ran into (Python 2.7.6 interpreter):
import h5py
from pybel import readstring
import json
import numpy as np
import pandas as pd
ani_path = '.../ani'
shard3 = os.path.join(ani_path, 'ani_gdb_s03.h5')
with h5py.File(shard3, 'r') as f:
data_dict = f['gdb11_s03/gdb11_s03-11']
coords = data_dict['coordinates']
elements = data_dict['species']
energies = data_dict['energies']
smi = ''.join(data_dict['smiles'])
mol = readstring('smi', smi)
jmol = json.loads(pymol_to_json(mol))
if len(jmol['atoms']) != len(elements[:]):
print "shard: ", shard1
print "\nmolecule: gdb11_s03/gdb11_s03-11"
print "\nsmile: ", smi
print "\nspecies:", elements[:]
print "\npybel mol:", jmol
print "\ncoordinates: ", coords.shape
with sample output:
shard: .../ani_gdb_s03.h5
molecule: gdb11_s03/gdb11_s03-11
smile: [H]C([H])=NN([H])[H]
species ['O' 'C' 'O' 'H' 'H']
pybel mol {u'atoms': [[1, 0], [6, 0], [1, 0], [7, 0], [7, 0], [1, 0], [1, 0]], u'bonds': [[1, 2, 1], [2, 3, 1], [2, 4, 2], [4, 5, 1], [5, 6, 1], [5, 7, 1]]}
coordinates: (4320, 5, 3)
Only the filepath should need to be edited back in for this to run. I also wrote a different parser than the example code because I was having trouble getting the iteration to perform consistently, so maybe I introduced an unintended error there.
I will filter my parquet files for similar mismatches and go-ahead without them for now. If I have made an obvious mistake or if this has already been identified I'd still appreciate feedback.
Thanks!