ogb icon indicating copy to clipboard operation
ogb copied to clipboard

PCQM4Mv2 invalid SMILES

Open rballeba opened this issue 11 months ago • 0 comments

Dear OGB team,

I have detected that the smiles included in the lsc-pcqm4mv2: 'O[Si]123O[Si]3(O1)(O2)O' (position 51128) in the dataset is an invalid smiles string according to the newest versions of RDKit (concretely, version 2024.09.5). This error can be reproduced in the following way:

from ogb.lsc import PygPCQM4Mv2Dataset
from ogb.utils import smiles2graph

def debug_smiles2graph(smiles_string):
    try:
        return smiles2graph(smiles_string)
    except Exception as e:
        print(f"Exception occurred in smiles: {smiles_string}")
        print(e)
        raise e

mol_ds = PygPCQM4Mv2Dataset(root='../data/pcqm4mv2_invariants', smiles2graph=debug_smiles2graph)

As you can observe when executing, the previous snippet produces the following output:

Processing...
Converting SMILES strings into graphs...
  1%|▏         | 50930/3746620 [00:18<21:40, 2842.28it/s][11:26:20] Explicit valence for atom # 1 Si, 5, is greater than permitted
  1%|▏         | 51128/3746620 [00:18<22:08, 2781.28it/s]
Exception occurred in smiles: O[Si]123O[Si]3(O1)(O2)O
'NoneType' object has no attribute 'GetAtoms'

If we now try to convert this smiles using RDKit without the ogb package:

from rdkit import Chem
problematic_smiles = 'O[Si]123O[Si]3(O1)(O2)O'
mol_generated = Chem.MolFromSmiles(problematic_smiles)
print(mol_generated is None)

we get a True output, making the smiles string invalid.

rballeba avatar Mar 05 '25 10:03 rballeba