ersilia icon indicating copy to clipboard operation
ersilia copied to clipboard

🦠 Model Request: Bioavailability

Open GemmaTuron opened this issue 1 year ago • 5 comments

Model Title

Bioavailability (TDC dataset)

Publication

Hello @zainab-ik!

As part of your Outreachy contribution, we have assigned you the dataset "Bioavailability" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

GemmaTuron avatar Oct 24 '22 10:10 GemmaTuron

On it @GemmaTuron Link to the Colab here

Zainab-ik avatar Oct 24 '22 14:10 Zainab-ik

Table of Content

  • [x] Dataset Overview [ ] What are we trying to predict?

  • [x] Stage 1 - Data Upload [ ] How many data points do we have in total and in each set (train/validation/test)?

  • [x] Stage 2 - Data Analysis [ ] Is this a classification or a regression problem? [ ] How many data points do we have in total and in each set (train/validation/test)? [ ] How many actives and inactives do we have? [ ] Can you show in a plot the distribution of actives/inactive, or the values for the regression?

  • [x] Data Modelling

  • [x] Train Model

  • [x] Evaluate Model [ ] What is the performance of your model? What metrics are you using and why? [ ] How do you interpret the ROC Curve you got? [ ] How could we improve the model?

Zainab-ik avatar Oct 26 '22 06:10 Zainab-ik

Data Overview; Bioavailability dataset from TDC

Bioavailability can be categorized under absorption of drugs and it's the fraction of an administered drug that becomes available at the site of action according to Ma. The dataset focuses mainly on oral bioavailability. Drugs administered orally are affected by a number of factors such as drug interactions, and physical and chemical properties of the drug resulting in increased or decreased drug bioavailability. The oral bioavailability of a drug determines the dose regimen at a specific time. When a drug enters blood circulation, several factors determine the fraction that gets to the site of action. Such factors include solubility, lipophilicity and also the first-pass metabolism. These factors make bioavailability challenging to predict.

What are we trying to predict? A drug will always have a bioavailability because as low as 0.1% gets to the site of action. The effect of the drug can then be determined by whether such drug has a high bioavailability or low bioavailability. We are predicting the bioavailability of the drug whether high or low

Zainab-ik avatar Oct 26 '22 07:10 Zainab-ik

Stage 1 - Data Upload

The **TDC Bioavailability ** dataset contains 640 drugs. The dataset was downloaded from TDC after its installation and then divided into the 3-way split (Test, Train and Validation) to evaluate the model effectively and prevent overfitting. The proportion size by default is set to 70% for train, 10% for validation and 20% for testing.

How many data points do we have in total and in each set (train/validation/test)? Based on the 3-way split Total - 640 drugs Train - 448 drugs Validation - 64 Test - 128

Zainab-ik avatar Oct 26 '22 07:10 Zainab-ik

Stage 2 - Data Analysis

The data analysis of the Bioavailability dataset gave us a deep insight into the data and the kind of results to expect. Is this a classification or a regression problem? The dataset has 2 distinct classes of 0 and 1.

  • The probability of 0 denotes inactivity (Low bioavailability)
  • The probability of 1 denotes active (High bioavailability) There are 492 active drugs denoted as 1 which is of high bioavailability and 148 inactive drugs denoted as 0 which is of low bioavailability. Therefore, we have a binary classification problem.

Deductions From the above, it can be deduced that the target variable is fairly imbalanced as it is more biased towards active drugs with high bioavailability. A vast majority of the drugs are in the high distribution class. We have a 76:24 % distribution

Data split How many data points do we have in total and in each set (train/validation/test)? How many actives and inactives do we have?

  • Train: There are 448 total drugs with 351 active drugs and 97 inactive drugs.
  • Validation: There are 64 total drugs with 51 active drugs and 13 inactive drugs.
  • Test: There are 128 total drugs with 90 active drugs and 38 inactive drugs.

For further analysis,

Data Visualization

The Matplotlib python package which makes data interactive was used to plot a bar graph of the 3 data points.

  • Distribution of active and inactive drugs in Train datapoint image

  • Distribution of active and inactive drugs in the Validation data points image

  • Distribution of active and inactive drugs in the Test data points image

Our deduction above states a bias towards the active drugs than the inactive drugs as seen in the plots.

Data Visualization 2 A graphical representation of the active and inactive drugs was generated using the RDKIT python package. This package is an open-source tool for cheminformatics.

  • Active drugs image

  • Inactive drugs image

Zainab-ik avatar Oct 28 '22 05:10 Zainab-ik

Data Modelling & Model Training

This stage involves creating a list of data that is used to train, validate and test our model. Our data make up of 3 variables;

  • The drug ID
  • The drug
  • Y - which denotes a bioavailability value of either 1 or 0

Our drugs comes in a smiles format (the CCN+(C)c1cccc(O)c1 representation)

image

The lazy-qsar module is adapted to convert the Smiles canonical to a Morgan fingerprint which represents a Numeric format that the computer understands. It helps train the binary classifier since it does its conversion. The model is then trained with the training dataset of active and inactive drugs.

Zainab-ik avatar Oct 28 '22 05:10 Zainab-ik

Model Evaluation (METRIC 1)

The binary classifier was evaluated based on the following metrics;

  • AUROC Value & ROC Curve; The AUC represents the degree of separability and tells how much the model is capable of distinguishing classes. The ROC curve is the curve of probability plotted with a True Positive Rate against a False Positive Rate.

It is important to note that this evaluation is heavily dependent on the dataset and an imbalanced dataset creates a biased model. Our dataset has a majority target variable of 1 which is active drugs and tends to be more biased towards the majority class. The evaluation was done for both Validation set and Test set An AUC Value leaning towards 1.0 is a good value while from 0.6 to 0.0 is a poor value.

Evaluating Validation set

  • AUC & ROC Curve; AUROC 0.5822021116138762 image

Explanation: The AUC Value is very low which translates to the model not having class separation capacity and this is due to the highly imbalanced dataset. When AUC is approximately 0.5, the model has no discrimination capacity to distinguish between positive class and negative class.

Evaluating Test set

  • AUC & ROC Curve; - AUROC 0.6314327485380117 image

Explanation: Though the AUC Value is low but higher compared to that of the validation set. The model still has a poor discrimination degree.

It is important to note that the most reliable measure of a drug bioavailability is AUC and our model performed poorly in distinguishing Manual of clinical Pharmacology

Zainab-ik avatar Oct 28 '22 09:10 Zainab-ik

Model Evaluation (METRIC 2)

  • Contingency Table; This is also known as the Confusion matrix. This table is used to assess errors made while the model is predicting.

The Confusion Matrix created has four different quadrants:

  • True Negative (TN-Top-Left Quadrant); An outcome where the model correctly predicts the negative class
  • False Positive (FP-Top-Right Quadrant); An outcome where the model incorrectly predicts the positive class
  • False Negative (FN-Bottom-Left Quadrant); An outcome where the model incorrectly predicts the negative class.
  • True Positive (TP-Bottom-Right Quadrant); A true positive is an outcome where the model correctly predicts the positive class

True means that the values were accurately predicted, False means that there was an error or wrong prediction

Evaluating the Validation set

  • TP = 51; The model correctly predicts the high bioavailability of the drug
  • TN = 1; Just one drug was predicted to have low bioavailability
  • FP =12; 12 drugs were misclassified as high bioavailability when they are actually of low bioavailability
  • FN = 0; No drug was misclassified as a low bioavailability drug The Confusion matrix image

Explanation; 52 drugs out of 64 were predicted correctly while 12 were predicted wrongly.

Evaluating test set

  • TP = 89; The model correctly classifies 89 drugs as high bioavailability and it is.
  • TN = 3; Just 3 drugs are classified as low bioavailability
  • FP =35; 35 drugs were misclassified as high bioavailability when they are actually of low bioavailability
  • FN = 1; 1 drug was misclassified as a low bioavailability drug The Confusion Matrix image

Explanation; 92 drugs out of 128 were predicted correctly while 36 were predicted wrongly

Conclusion; from the confusion matrix, the model is good at predicting drugs of high bioavailability but it's unable to predict drugs of low bioavailability and in turn misclassify them as high instead of low. This occurs because we have a minority class for low bioavailability drugs (very few representations). The implication of the False Positive is that, when a drug is classified as high bioavailability, it is expected to take effects at whatever dosage is prescribed but a wrong prediction by the model translates to the drug not taking effect. This, therefore, has a very high implication on the body. Note; Results from these predictions should be compared with the data split for better understanding.

Zainab-ik avatar Oct 28 '22 11:10 Zainab-ik

Hi @Zainab-ik !

Thanks for your work! I'd say the model hasn't had enough training time since the AUC is not great, you can try to see what happens with longer training times.

GemmaTuron avatar Oct 28 '22 12:10 GemmaTuron

Model Evaluation (Metric 3)

  • Precision; Quantifies the number of correct positive predictions made i.e. What proportion of positive identifications was actually correct. It is calculated as image Note; Lets say our model has a precision of 0.5 — in other words, when it predicts a drug as a high bioavailability drug , it is correct 50% of the time.

Evaluating Validation set

Our precision value this set is 0.8095238095238095 which can translates to 81%. This can be interpreted to saying our models is correct 81% of the time it classifies a drug as having high bioavailability.

Evaluating Test set

Our precision value this set is0.717741935483871 which can translates to 72%. This can be interpreted to saying our models is correct 72% of the time it classifies a drug as having high bioavailability.

A precision close to 1 (100%) is good which shows positive prediction is correct. It should be noted that our model is pretty good at classifying a drug as high bioavailability drug due to the majority class.

Zainab-ik avatar Oct 28 '22 12:10 Zainab-ik

Hi @Zainab-ik !

Thanks for your work! I'd say the model hasn't had enough training time since the AUC is not great, you can try to see what happens with longer training times.

Thanks @GemmaTuron, What do you say about the Imbalanced dataset, do you think I should handle that?

  1. Should I try resampling?
  2. Should I consider an F1 Score for evaluation?

Zainab-ik avatar Oct 28 '22 12:10 Zainab-ik

Hi @Zainab-ik !

You could try resampling, but I think for the purposes of the contribution this is already enough, so when you are ready close the issue and link it to your outreachy contribution, and start preparing the final application

GemmaTuron avatar Oct 28 '22 12:10 GemmaTuron

Hi @Zainab-ik !

Thanks for your work! I'd say the model hasn't had enough training time since the AUC is not great, you can try to see what happens with longer training times.

Model Improvement

The Initial learning algorithm of our model is Random Forest and returns a low AUROC Value. To improve the model, the XGBoost supervised learning algorithm was adopted and the model was trained for 10 minutes.

Results

  • Validation set ; AUROC 0.6289592760180995 image

This denotes an improved model compared to the initial value

  • Test set; AUROC 0.7084795321637427 image

This also shows an improvement in the model classification ability

The RandomForest classifier improves the model better at 10 minutes

Suggestions to improve models

  • Train for a longer period; my deduction is that the longer you train the model for, the better it does at distinguishing classes
  • Adopt other learning algorithm

Zainab-ik avatar Oct 28 '22 20:10 Zainab-ik

@GemmaTuron, I'd like to ask, I realized some models trained on imbalanced target variable tends to have a high AUROC value, could it be that the value is not largely dependent on the dataset or affected by other factors? Also, what is the best metrics for evaluating the model considering the high disparity in the dataset?

Zainab-ik avatar Oct 28 '22 21:10 Zainab-ik

Hi @Zainab-ik !

Thanks for your work! I'd say the model hasn't had enough training time since the AUC is not great, you can try to see what happens with longer training times.

@GemmaTuron, Thanks Kindly review my correction, I'd like to hear your feedback.

Zainab-ik avatar Oct 28 '22 21:10 Zainab-ik

@GemmaTuron, I'd like to ask, I realized some models trained on imbalanced target variable tends to have a high AUROC value, could it be that the value is not largely dependent on the dataset or affected by other factors? Also, what is the best metrics for evaluating the model considering the high disparity in the dataset?

HI @Zainab-ik !

You can check and compare your results with a precision-recall curve for that.

Femme-js avatar Oct 29 '22 12:10 Femme-js

In general, when we are predicting probabilities with the classifier ( as in the case of the Morgan Binary Classifier), and we need class labels, there are two intuitions, If the positive class more important, then we chose PR AUC (area under precision-recall curve) and If both class are equally important, then ROC AUC (area under roc)

Femme-js avatar Oct 29 '22 12:10 Femme-js

@Zainab-ik why the need to attach a training time isn't that limiting the model?

EstherIdabor avatar Oct 29 '22 16:10 EstherIdabor

@Zainab-ik I read that F1 score is a good metric to measure the performance of a model on an imbalanced dataset, you can try it out.

EstherIdabor avatar Oct 29 '22 16:10 EstherIdabor

@Zainab-ik why the need to attach a training time isn't that limiting the model?

Hi Zainab with your comment here, I decided to increase the training time of my model and it improved it significantly. Gracias!

EstherIdabor avatar Oct 29 '22 17:10 EstherIdabor

In general, when we are predicting probabilities with the classifier ( as in the case of the Morgan Binary Classifier), and we need class labels, there are two intuitions, If the positive class more important, then we chose PR AUC (area under precision-recall curve) and If both class are equally important, then ROC AUC (area under roc)

Both classes are very important and predicting drugs of low bioavailability as high has high implications. This directly translates to a lower dosage for the drug which won't produce the desired effect.

Zainab-ik avatar Oct 29 '22 19:10 Zainab-ik

@Zainab-ik why the need to attach a training time isn't that limiting the model?

From my understanding, I don't think so, it gives the model a long time to train in order to make better predictions and you can also specify the learning algorithm you want your model to learn from.

Zainab-ik avatar Oct 29 '22 19:10 Zainab-ik

@Zainab-ik I read that F1 score is a good metric to measure the performance of a model on an imbalanced dataset, you can try it out.

Thank you @EstherIdabor, I'd do so.

Zainab-ik avatar Oct 29 '22 19:10 Zainab-ik

@Zainab-ik why the need to attach a training time isn't that limiting the model?

Hi Zainab with your comment here, I decided to increase the training time of my model and it improved it significantly. Gracias!

@EstherIdabor, That's great, how long did you train and did you change your learning algorithm?

Zainab-ik avatar Oct 29 '22 19:10 Zainab-ik

@Zainab-ik why the need to attach a training time isn't that limiting the model?

Hi Zainab with your comment here, I decided to increase the training time of my model and it improved it significantly. Gracias!

@EstherIdabor, That's great, how long did you train and did you change your learning algorithm?

1800sec that's 30min, the learning algorithm was left at default

EstherIdabor avatar Oct 29 '22 22:10 EstherIdabor

Hi @Zainab-ik !

Great job and thanks for the discussion @EstherIdabor and @Femme-js ! I'll go ahead and close this issue so you can focus on the final application.

GemmaTuron avatar Oct 31 '22 13:10 GemmaTuron