Antimicrobial-Peptides icon indicating copy to clipboard operation
Antimicrobial-Peptides copied to clipboard

negative values in grampa.csv file

Open UnixJunkie opened this issue 2 years ago • 10 comments

Some concentration values are negative. I don't think this is possible, so there is a problem somewhere that introduced those negative values.

UnixJunkie avatar Nov 24 '21 07:11 UnixJunkie

@UnixJunkie I have the same question. Could you understand why? The possible answer is that the values are normalized.

@zswitten and @jswitten Thanks for your great contribution. Can you explain the MIC values? If I want to change it to a binary classification problem (AMP and non-AMP) how to decide on threshold value?

qm-intel avatar Mar 08 '23 03:03 qm-intel

I don't know exactly. It is possible that some standardization procedure shifted the original values.

UnixJunkie avatar Mar 08 '23 03:03 UnixJunkie

It's log (MIC in uM) so any MIC < 1uM will have a negative value

Regarding the threshold value, it's totally arbitrary, there's no one absolutely correct way except I guess everything in the database is a positive in some sense. But because thresholding is inherently arbitrary, we used regression in our paper

jswitten avatar Mar 08 '23 03:03 jswitten

We did convert to a classification problem in order to benchmark our results and we used both totally random peptides and random peptides from Uniprot I believe, I kind of forget, you can read our paper

jswitten avatar Mar 08 '23 03:03 jswitten

From the AMP literature, I would say that having a MIC value <= 32 ug/mL might be a reasonable threshold. Given the quality of public data for this problem (this is a meta dataset; aggregating values from many different experiments in many different labs, I think that treating the problem as a classification one is way safer than regression).

UnixJunkie avatar Mar 08 '23 03:03 UnixJunkie

@jswitten Thanks for your reply,

I just draw a histogram of MIC values:

image

It's log (MIC in uM) so any MIC < 1uM will have a negative value

Regarding the threshold value, it's totally arbitrary, there's no one absolutely correct way except I guess everything in the database is a positive in some sense. But because thresholding is inherently arbitrary, we used regression in our paper

In your paper in Section entitled (Ensemble model), you have mentioned:

"The prediction was either very close to 4 (meaning, a predicted inactive peptide) or somewhere between -1 and 3.5 (meaning, a predicted active peptide). Therefore, for the purposes of classification (Section 3.3), instead of averaging over each of the ensemble model predictions, we had each model in the ensemble “vote.” If more than half of the models predicted log MIC > 3.9, we classified the peptide as inactive and predicted log MIC = 4. Otherwise, we classified the peptide as active and the predicted log MIC (used for generation of the ROC curves in place of a probabilistic prediction) was the average over all predictions that were <3.9."

Is log MIC<= 3.5uM your threshold boundary for the active peptides? In that case, the number of non-AMP (inactive peptide) samples for training becomes a very small number (imbalance) compared to active AMPs.

Sorry again for the long question. But I could not find a clear answer in other literature, and your dataset is the only one that I can use for use-case.

qm-intel avatar Mar 08 '23 07:03 qm-intel

@UnixJunkie Thanks for your reply,

From the AMP literature, I would say that having a MIC value <= 32 ug/mL might be a reasonable threshold. Given the quality of public data for this problem (this is a meta dataset; aggregating values from many different experiments in many different labs, I think that treating

Can you please mention the title of one of the papers that have mentioned MIC value <= 32 threshold value?

In some literature the MIC value <= 25 ug/mL has been suggested too. But in GRAMPA the scale is uM. Please see my question above. In this case, what threshold can be decided? Thanks

qm-intel avatar Mar 08 '23 08:03 qm-intel

What I did was declare all peptides in the dataset to be positives and generate negatives either by generating completely random peptides or by taking random peptides form UniProt, see Table 2, Table S3, and related discussion. So the negatives were synthetically generated and every peptide in the dataset is positive because every peptide in GRAMPA has been reported antimicrobial to something. Or in other words threshold I used was "in GRAMPA vs not in GRAMPA"

jswitten avatar Mar 08 '23 12:03 jswitten

@jswitten Thank you for the clarification

qm-intel avatar Mar 08 '23 13:03 qm-intel

Some authors from a US lab generate negatives by randomizing the order of amino acids from the sequences of known actives. There is a rational for this procedure: it destroys the hydrophobic moment of known actives, which means such peptides cannot anymore perturbate the membrane of microbes (which is the assumed mode of action for many antimicrobial peptides).

UnixJunkie avatar Mar 09 '23 00:03 UnixJunkie