pygod icon indicating copy to clipboard operation
pygod copied to clipboard

Inconsistent prediction: pred in logger vs pred from .predict function

Open chpoonag opened this issue 1 year ago • 2 comments

I have got a GAE model trained with data pyg_graph_train. Then, I use pyg_graph_test for model prediction.

I tried this: pred, score = model.predict( pyg_graph_test, label = pyg_graph_test.label, return_score=True ) And I got "Recall 0.7490 | Precision 0.7490 | AP 0.6226 | F1 0.7490"

But when I check the pred and score: f1_score(y_true=pyg_graph_test.label, y_pred=pred) I got 0.34680888045878483, which is inconsistent.

I found that the returned pred from the predict function is not the same as that in the logger function (pygod.utls.utility), because of different threshold values. In the logger function: contamination = sum(target) / len(target) threshold = np.percentile(score, 100 * (1 - contamination)) pred = (score > threshold).long()

In contrast, in the predict function (pygod.detector.base): if return_pred: pred = (score > self.threshold_).long() The "self.threshold_" is determined in _process_decision_score as: self.threshold_ = np.percentile(self.decision_score_, 100 * (1 - self.contamination))

So, which prediction (i.e. which threshold value) is correct? Or is there something I may have missed/overlooked instead?

chpoonag avatar Nov 05 '24 04:11 chpoonag

Sorry for the confusion.

If you do have the label or you know exactly how many outliers are in the dataset, e.g., 15%, you can specify the contamination in the initialization of the detector, for example model = DOMINANT(contamination=0.15). The model will make the binary prediction pred based on this contamination.

However, in many cases, our user does not have any label. We set a default contamination to 0.1. The threshold is changed correspondingly. That's why you got ~0.3 for F1. The ~0.7 F1 is evaluated with labels, which means the contamination is set to an ideal value.

To avoid setting the threshold, we also provide AUC, AP, and Recall@k for easier evaluation.

kayzliu avatar Nov 08 '24 02:11 kayzliu

Hello. I've also been using GAE for anomaly detection recently. However, errors have been constantly reported during the import process. Could I refer to your usage code?

The following is my error message. Thank you very much.

RuntimeError: pyg::neighbor_sample() Expected a value of type 'Optional[Tensor]' for argument 'edge_weight' but instead found type 'bool'.

The following is my code:

from pygod.detector import GAE from pygod.utils import load_data from sklearn.metrics import roc_auc_score, average_precision_score

Function to train the anomaly detector

def train_anomaly_detector(model, graph): return model.fit(graph)

Function to evaluate the anomaly detector

def eval_anomaly_detector(model, graph): outlier_scores = model.decision_function(graph) auc = roc_auc_score(graph.y.numpy(), outlier_scores) ap = average_precision_score(graph.y.numpy(), outlier_scores) print(f'AUC Score: {auc:.3f}') print(f'AP Score: {ap:.3f}')

graph = load_data('weibo')

Initialize and evaluate the model

graph.y = graph.y.bool()

if hasattr(graph, 'edge_weight'): graph.edge_weight = None

model = GAE(epoch=100) model = train_anomaly_detector(model, graph) eval_anomaly_detector(model, graph)

withMoonstar avatar Nov 27 '24 00:11 withMoonstar