reaction_utils icon indicating copy to clipboard operation
reaction_utils copied to clipboard

USPTO processing pipeline: "remove_unsanitizable" implies "trim_rxn_smiles" called before

Open Academich opened this issue 1 year ago • 1 comments

I have a task that requires USPTO with only sanitizable molecules but also with CXSMILES information retained. However, if I keep CXSMILES, the "remove_unsanitizable" pipeline step tries to sanitize products together with CXSMILES and naturally fails, which results in 700k reactions being invalidated. It would be nice if the product SMILES never ended up containing CXSMILES when being processed by RDKit, even if CXSMILES were not removed.

Academich avatar Nov 01 '23 14:11 Academich

Thanks for your feedback. This is on our to-do list. It would naturally be better to parse the CXSMILES correct, which would entail employing level-1 parenthesis around SMILES that should be considered as one molecule.

SGenheden avatar Nov 02 '23 06:11 SGenheden