ImageMol issue with ImageMol performance

Hello, training_class.out.txt train_reg.out.txt

I am reaching out to seek help regarding the ImageMol package. I have used ImageMol to predict the antibiotic activity of compounds both for continuous and binary data. However, I am facing some issues and I am hoping you or one member of your group could help me in this regard. For classification, the log file shows that the AUC is 0.725, however, I noticed all the probabilities are close to zero and therefore all the y_pred are zero. I was wondering if you have any suggestions to solve this problem. I am not sure why all the y_pred are zero with this high AUC.

For regression, all the y_score have the same value.

I really appreciate your help in this matter.

For your reference, I am attaching the log files for both classification and regression. Could you see if there is anything unusual in the log files?

Best Regards Soo

Jan 02 '24 16:01 SoodabehGhaffari

Hi, Soo

Thank you first for your attention to our papers.

In some cases, AUC and accuracy may conflict. This is because the accuracy is calculated based on the default cutoff value (such as 0.5), while the AUC is calculated based on all possible cutoff values, which should be more robust.

I am not sure whether your sample has class imbalance. If so, you can try to increase the weight of the minority class.

You can also provide me with the predicted probabilities and corresponding ground truth pickle or numpy files so that I can do better analysis.

Jan 03 '24 04:01 HongxinXiang

Thank you for the prompt reply. Yes, our training and test data are so imbalanced (around 1% positive). Could you guide how to increase the weight of the minority class and what value should I set?

Here are the ground truth and the predicted probabilities for the test data:

df_pro_imagemol_classification.csv df_test.csv.txt

Also, do you have any idea why the predicted values for the regression model for the test data are the same? Here is the predicted values for the regression model:

df_scores.csv

I really appreciate your help.

Best Regards Soo

Jan 03 '24 15:01 SoodabehGhaffari

Hi, Soo

The following figure is the AUC curve I drew based on the provided df_pro_imagemol_classification.csv and df_test.csv.txt: Given that your sample is unbalanced, there are two approaches to consider:

Find the best classification threshold based on the AUC curve:

def find_optimal_cutoff(tpr, fpr, threshold):
    y = tpr - fpr
    index = np.argmax(y)
    optimal_threshold = threshold[index]
    point = [fpr[index], tpr[index]]
    return optimal_threshold, point
fpr, tpr, threshold = roc_curve(y_true, y_pro) 
find_optimal_cutoff(tpr, fpr, threshold)
# my output is (0.001, [0.22790055248618785, 0.5833333333333334])

so, I use 0.001 as classification threshhold:

a = np.array(y_pro).copy()
a[a>0.001] = 1
a[a<0.001] = 0
(a == y_true).sum() / 2208

I can get 0.7685688405797102 accuracy.

Add the weight parameter in BCEWithLogitsLoss. You can look the docs. it's very easy to get started. I saw that the minority class is only 1%. You might as well try setting the minority class weight to 100 and the majority class weight to 1.

Anyway, in cases of extremely imbalanced samples, I recommend reporting the AUC metric since it is a more comprehensive metric and better suited to imbalanced data.

In addition, I'm not sure that why the predicted values for the regression model for the test data are the same. But I'm guessing that your regression labels may have a large gap, causing the model to collapse during training. I suggest you can use some normalization method on the labels.

Jan 04 '24 06:01 HongxinXiang

Thank you for the detailed response. I really appreciate your help.

Does it make sense to have a so small threshold for the classification model such as 0.001?
I checked BCEWithLogitsLoss in the ImageMol. Is it correct to change the code as follows :

weights = None if args.task_type == "classification": if args.weighted_CE: labels_train_list = labels_train[labels_train != -1].flatten().tolist() count_labels_train = Counter(labels_train_list) imbalance_weight = {key: 1 - count_labels_train[key] / len(labels_train_list) for key in count_labels_train.keys()} weights = np.array(sorted(imbalance_weight.items(), key=lambda x: x[0]), dtype="float")[:, 1]
```
    num_positives = count_labels_train[1]  # assuming 1 is the label for positive class
   num_negatives = count_labels_train[0]  # assuming 0 is the label for negative class

  ratio_pos_neg = num_positives / num_negatives if num_negatives != 0 else 1

  pos_weight = torch.tensor([ratio_pos_neg])

 criterion = nn.BCEWithLogitsLoss(reduction="none",pos_weight=pos_weight)
```
Regarding the regression model, the training data for the classification and regression is the same with one difference: the continuous label in the regression was converted to binary for the classification. Since the data for classification is imbalanced, I am sure that the labels have a gap. Most of the labels are between zero to 10%. Do you have any suggestions for the normalization method on the labels?

Thank you Best Regards Soo

Jan 04 '24 15:01 SoodabehGhaffari

Sorry for the late reply.

This is an open question and it's hard for me to answer you. In my experience, I would not focus on model accuracy in imbalanced data because it is meaningless. So, I would focus on the AUC since it has nothing to do with the threshold.
It is correct.
You can try StandardScaler from scikit-learn library.

Jan 07 '24 07:01 HongxinXiang

Hello, I wanted to give you an update that I tried using the positive weights for the minority class in the loss function as we discussed before, but the issue persists. The values of predicted probability are too small. Do you have any other suggestions or do you think there is a way to fix this issue?

Thanks a lot Best Regards Soo

Jan 11 '24 14:01 SoodabehGhaffari

ImageMol ImageMol copied to clipboard

issue with ImageMol performance

ImageMol
ImageMol copied to clipboard