ersilia 🦠 Model Request: TDC Skin Reaction

Model Title

Skin Reaction (TDC dataset)

Publication

Hello @alaminumar!

As part of your Outreachy contribution, we have assigned you the dataset "Skin Reaction" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

Oct 24 '22 09:10 GemmaTuron

Thanks Gemma.

Oct 24 '22 11:10 alaminumar

Sorry for the lateness.

Skin Reaction Dataset overview: I'm working on Skin Reaction Dataset. Exposure to chemical agents can induce an immune reaction in susceptible individuals that lead to skin sensitization. Given the smile drug, can we predict whether it can cause a skin reaction 1 or 0. The Dataset contains 404 drugs.

Importing Dataset: I have successfully installed TDC package and imported the Skin Reaction Dataset from Toxicity Single instance prediction Datasets from the TDC package

Oct 26 '22 09:10 alaminumar

Splitting Datasets: Successfully Split the model into three datasets.

Train Dataset: The dataset that is going to train our model. The Dataset contains 283 drugs with 196 active and 87 inactive
Validation Dataset: The dataset that is going to Evaluate, optimize and improve our model. The Dataset contains 40 drugs with 29 active and 11 inactive
Test Dataset: The dataset that is going to test our model after training and validation. The Dataset contains 81 drugs with 49 active and 32 inactive

Oct 26 '22 10:10 alaminumar

Data Visualization Used matplotlib to visualize the amount of actives(1) and inactives(0) we have in our Dataset. As the image shows this is clearly a binary classification problem matplotlibimage

Using RDKIT we can Visualize the moleculatr structure of our Smiles . Succesfuly imported and drawn an active and inactive molecule respectfully. rdkit

Oct 27 '22 09:10 alaminumar

@GemmaTuron can you review what i have done. Here is my Colab

Oct 27 '22 13:10 alaminumar

Hi @alaminumar !

Good start, but can you provide an explanation of the model performances?

Oct 28 '22 12:10 GemmaTuron

Okay Gemma. First let me explain how we have gotten our models.

**Model Training: ** We train our model when we take Smile Drug as input(X) in our model and pass Y as it's output which is its predicted bioactivity. We use Lazy-QSAR model and MorganBinaryClassifier for our training, thus don't need to convert smiles into signatures as it is done automatically.

**Evaluate Model: ** In order to Evaluate our model, we use the following.

Precision & Recall Precision is the ratio between the positives our models correctly predicted and the number of positives our model predicted correct or otherwise. Tp/(Tp + Fp) Recall : How many Positives we are were able to identify. The ratio is Tp/(Tp +Fn)
AUROC value
AUC graph
& Confusion (Contingency Matrix)

To answer your question Gemma . My model performance for my first iteration was average to poor. So, I decided to double the time we trained the model to 3600 seconds . My first iteration had an AUROC value of 0.61128 and 0.7708 for the validation and test models respectively . As we can see its not that good . Here are the corresponding graphs and data for the second iteration.

Validation Precision 0.7368421052631579 Recall 0.9655172413793104

Contingency Matrix as we can see from the confusion matrix we have 38/41 accurately predicted . This is very good
ROC Curve AUROC Value 0.6332288401253919