qca-dataset-submission icon indicating copy to clipboard operation
qca-dataset-submission copied to clipboard

Potential dataset: Drugbank all (13K molecules)

Open jchodera opened this issue 6 years ago • 7 comments

The DrugBank Open Data datasets are available here, and contains ~13K molecules that mostly cover approved drugs.

The DrugBank Open Data datasets are public domain datasets that can be used freely in your application or project (including commercial use). It is released under a Creative Common’s CC0 International License. To the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.

Information on how to retrieve "Drugbank all" is available here.

jchodera avatar Jul 05 '19 18:07 jchodera

You'd want to filter in various ways first -- there's a pretty significant amount of stuff we don't cover as well as bad chemistry, etc. Main issues would probably be to make sure we remove metals and metalloids as well as particularly small (<3 heavy atoms) and particularly large (>100 heavy atoms) molecules, e.g. see README.md for minidrugbank. https://github.com/openforcefield/MiniDrugBank

davidlmobley avatar Jul 05 '19 18:07 davidlmobley

(I suppose you could run all of it, but some of it won't be useful for parameterization at present, e.g. oxaliplatin is likely not on our short list.)

davidlmobley avatar Jul 05 '19 18:07 davidlmobley

Agreed!

One important question: If we wanted to standardize our filtering tools, where should those live? @ChayaSt has some tools in https://github.com/openforcefield/fragmenter already, so perhaps we could also add other filters (such as number of atoms, MW, and "can SMIRNOFF type it?") to a module of fragmenter as well? Or should these go into the openforcefield toolkit?

jchodera avatar Jul 05 '19 18:07 jchodera

@ChayaSt has some tools in https://github.com/openforcefield/fragmenter already, so perhaps we could also add other filters (such as number of atoms, MW, and "can SMIRNOFF type it?") to a module of fragmenter as well? Or should these go into the openforcefield toolkit?

Actually, those tools are not in fragmenter but in a notebook here https://github.com/choderalab/fragmenter_data/blob/bond_order/combinatorial_fragmentation/filter/filter_full_drugbank.ipynb (cell 6) It would be great if all of the filters can live in one place. I'm not sure if fragmenter is the best place for this module - it probably makes more sense to put it into the openforcefield toolkit.

ChayaSt avatar Jul 05 '19 19:07 ChayaSt

See https://github.com/openforcefield/openforcefield/issues/376

jchodera avatar Jul 05 '19 19:07 jchodera

Note the filtering tools Caitlin did for MiniDrugBank are here, https://github.com/openforcefield/MiniDrugBank/blob/master/minidrugbank/pickMolecules.ipynb . We could easily just use the filtering she did in the first cell to get a reasonable first pass then prune down further by size.

davidlmobley avatar Sep 07 '19 04:09 davidlmobley

For reference, the DrugBank Release Version 5.1.4 obtained from here (CC0 license) is here: drugbank_all_open_structures.sdf.zip

jchodera avatar Sep 08 '19 00:09 jchodera