ersilia
ersilia copied to clipboard
🦠Model Request: TDC Skin Reaction
Model Title
Skin Reaction (TDC dataset)
Publication
Hello @alaminumar!
As part of your Outreachy contribution, we have assigned you the dataset "Skin Reaction" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.
Code
No response
Thanks Gemma.
Sorry for the lateness.
Skin Reaction Dataset overview: I'm working on Skin Reaction Dataset. Exposure to chemical agents can induce an immune reaction in susceptible individuals that lead to skin sensitization. Given the smile drug, can we predict whether it can cause a skin reaction 1 or 0. The Dataset contains 404 drugs.
Importing Dataset: I have successfully installed TDC package and imported the Skin Reaction Dataset from Toxicity Single instance prediction Datasets from the TDC package
Splitting Datasets: Successfully Split the model into three datasets.
- Train Dataset: The dataset that is going to train our model. The Dataset contains 283 drugs with 196 active and 87 inactive
- Validation Dataset: The dataset that is going to Evaluate, optimize and improve our model. The Dataset contains 40 drugs with 29 active and 11 inactive
- Test Dataset: The dataset that is going to test our model after training and validation. The Dataset contains 81 drugs with 49 active and 32 inactive
Data Visualization
Used matplotlib to visualize the amount of actives(1) and inactives(0) we have in our Dataset. As the image shows this is clearly a binary classification problem
Using RDKIT we can Visualize the moleculatr structure of our Smiles . Succesfuly imported and drawn an active and inactive molecule respectfully.
@GemmaTuron can you review what i have done. Here is my Colab
Hi @alaminumar !
Good start, but can you provide an explanation of the model performances?
Okay Gemma. First let me explain how we have gotten our models.
**Model Training: ** We train our model when we take Smile Drug as input(X) in our model and pass Y as it's output which is its predicted bioactivity. We use Lazy-QSAR model and MorganBinaryClassifier for our training, thus don't need to convert smiles into signatures as it is done automatically.
**Evaluate Model: ** In order to Evaluate our model, we use the following.
- Precision & Recall Precision is the ratio between the positives our models correctly predicted and the number of positives our model predicted correct or otherwise. Tp/(Tp + Fp) Recall : How many Positives we are were able to identify. The ratio is Tp/(Tp +Fn)
- AUROC value
- AUC graph
- & Confusion (Contingency Matrix)
To answer your question Gemma . My model performance for my first iteration was average to poor. So, I decided to double the time we trained the model to 3600 seconds . My first iteration had an AUROC value of 0.61128 and 0.7708 for the validation and test models respectively . As we can see its not that good . Here are the corresponding graphs and data for the second iteration.
Validation Precision 0.7368421052631579 Recall 0.9655172413793104
-
Contingency Matrix
as we can see from the confusion matrix we have 38/41 accurately predicted . This is very good
-
ROC Curve
AUROC Value 0.6332288401253919
Test Precision of a Test Set: 0.6125 Recall of a Test Set: 1.0
- Contingency Matrix
- ROC Curve
AUROC 0.7822066326530612
Hi @alaminumar
I hope everything is solved, good job on the modelling. I'll mark this as completed and you can move onto finalising your outreachy application!