TDC
TDC copied to clipboard
edit metadata for dtc
Dataset Description Drug Target Commons (DTC) is a crowd-sourcing platform to improve the consensus and use of drug-target interactions. The end users can search, view and download bioactivity data using various compound, target and publications identifiers. Expert users may also submit suggestions to edit and upload new bioactivity data, as well as participate in the assay annotation and data curation processes. Task Description Regression. Given the target amino acid sequence/compound SMILES string, predict their binding affinity. Dataset Statistics (# of DTI pairs, # of drugs, # of proteins) 54,339/42,307/580 for EC50, 22,227/7,867/978 for Kd, 406,422/279,566/1,937 for IC50, 240,895/112,376/1,197 for Ki.
Dataset Split Random Split Cold Drug Split Cold Protein Split
Note: Drug Target Commons is the collection of many assays. Since different assays use different metrics, TDC separates them as separate datasets. Specifically, it has four datasets with Kd, IC50, Ki, EC50 as the units. All units are in NM.
Tips: Transforming to log-scale pIC50, pKi, and pKd can usually lead to more stable training. You can achieve this transformation via here. Checkout the data processing page for binarization, label distribution visualization, edge list/DGL/PyTorch graph transformation.
from tdc.multi_pred import DTI
data = DTI(name = 'dtc_kd')
# data = DTI(name = 'dtc_ki')
# data = DTI(name = 'dtc_ic50')
# data = DTI(name = 'dtc_ec50')
split = data.get_split()
Note: Many DTI pairs have the same sequence information but different binding affinities due to different experimental assays. To harmonize them, you can use the below function to retrieve either the maximum affinity or the mean for the duplicated pair:
from tdc.multi_pred import DTI
data = DTI(name = 'dtc_kd')