qca-dataset-submission
qca-dataset-submission copied to clipboard
Potential dataset: Drugbank all (13K molecules)
The DrugBank Open Data datasets are available here, and contains ~13K molecules that mostly cover approved drugs.
The DrugBank Open Data datasets are public domain datasets that can be used freely in your application or project (including commercial use). It is released under a Creative Common’s CC0 International License. To the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.
Information on how to retrieve "Drugbank all" is available here.
You'd want to filter in various ways first -- there's a pretty significant amount of stuff we don't cover as well as bad chemistry, etc. Main issues would probably be to make sure we remove metals and metalloids as well as particularly small (<3 heavy atoms) and particularly large (>100 heavy atoms) molecules, e.g. see README.md for minidrugbank. https://github.com/openforcefield/MiniDrugBank
(I suppose you could run all of it, but some of it won't be useful for parameterization at present, e.g. oxaliplatin is likely not on our short list.)
Agreed!
One important question: If we wanted to standardize our filtering tools, where should those live? @ChayaSt has some tools in https://github.com/openforcefield/fragmenter already, so perhaps we could also add other filters (such as number of atoms, MW, and "can SMIRNOFF type it?") to a module of fragmenter as well? Or should these go into the openforcefield toolkit?
@ChayaSt has some tools in https://github.com/openforcefield/fragmenter already, so perhaps we could also add other filters (such as number of atoms, MW, and "can SMIRNOFF type it?") to a module of fragmenter as well? Or should these go into the openforcefield toolkit?
Actually, those tools are not in fragmenter but in a notebook here
https://github.com/choderalab/fragmenter_data/blob/bond_order/combinatorial_fragmentation/filter/filter_full_drugbank.ipynb (cell 6)
It would be great if all of the filters can live in one place. I'm not sure if fragmenter is the best place for this module - it probably makes more sense to put it into the openforcefield toolkit.
See https://github.com/openforcefield/openforcefield/issues/376
Note the filtering tools Caitlin did for MiniDrugBank are here, https://github.com/openforcefield/MiniDrugBank/blob/master/minidrugbank/pickMolecules.ipynb . We could easily just use the filtering she did in the first cell to get a reasonable first pass then prune down further by size.
For reference, the DrugBank Release Version 5.1.4 obtained from here (CC0 license) is here: drugbank_all_open_structures.sdf.zip