qca-dataset-submission
qca-dataset-submission copied to clipboard
Add several high priority datasets for benchmarking
We need several additional datasets for benchmarking/testing. @jchodera has volunteered to prep these this weekend, so this issue is to get everything all in the same place in order of the priority I would assign them:
- Pfizer set. 100 challenging fragments from Pfizer for torsion drives. #50
- Genentech set. Optimization dataset as provided, filtering out largest molecules first. Then optimization dataset and torsion drive dataset after fragmentation. #48
- DrugBank FDA drugs. DrugBank discussed here would be a good set; I'd focus on the FDA-approved small-molecule drugs and then throw out everything big and everything very small, then fragment for optimization and torsion drives. Probably also remove anything with pentavalent carbon for good measure. Problem: I don't have a DrugBank account yet and it takes two business days for one to be approved, it seems.
- Informative set. Optimization dataset of 1117 informative fragments. Discussed in issue #46 . (The larger set includes 9000 compounds which could be fragmented and torsion drives could be done.)
I'm checking into some options on (3) so I might have updates. Or not.
- [x] Pfizer 100 fragment discrepancy set
OptimizationDataset(100 molecules): #55 - [x] Pfizer 100 fragment discrepancy set
TorsionDriveDataset(227 torsion drives): #56 - [ ] Genentech Ligand Expo subset (648 molecules before fragmentation)
OptimizationDatasetandTorsionDriveDataset: #48 - [x] Informative set from Jordan Ehrman (1117 fragments) #47
- [x] DrugBank FDA drugs
OptimizationDataset(939 unique molecules, 6559 conformers): #57