MGTAB
MGTAB copied to clipboard
A Multi-relational Graph-Based Twitter Account Detection Benchmark
MGTAB
MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark
Introduction
MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. For more details, please refer to the MGTAB paper.
Distribution of labels in annotations.
Stance | Bot | ||||
Lable | Count | Percentage | Lable | Count | Percentage |
neutral | 3,776 | 37.02 | human | 7,451 | 73.06 |
against | 3,637 | 35.66 | bot | 2,748 | 26.94 |
support | 2,786 | 27.32 |
Multiple relations in the MGTAB.
Our proposed dataset has seven types of user relationships.
MGTAB | |||||||
Edge type | followers | friends | mention | reply | quoted | URL | hashtag |
Numbers | 308,120 | 412,575 | 114,516 | 223,466 | 77,631 | 263,800 | 300,000 |
MGTAB-large | |||||||
Edge type | followers | friends | mention | reply | quoted | URL | hashtag |
Numbers | 31,990,488 | 49,668,723 | 7,135,192 | 1,018,834 | 182,296 | 51,281 | 7,950,896 |
Enviromment
python 3.7
scikit-learn 1.0.2
torch 1.8.1+cu111
torch_cluster-1.5.9
torch_scatter-2.0.6
torch_sparse-0.6.9
torch_spline_conv-1.2.1
torch-geometric 2.0.4
pytorch-lightning 1.5.0
Train Model
To start training process:
Train GNN models
python MGTAB-GNN.py --task stance --model GCN --relation_select 0 1 --random_seed 0 1 2 3 4
python MGTAB-GNN.py --task bot --model RGCN --relation_select 0 1 --random_seed 0 1 2 3 4
Train Machine Learning models
python MGTAB-ML.py --task stance --models_list 1 2 3 --random_seed 0 1 2 3 4
python MGTAB-ML.py --task bot --models_list 4 5 6 7 --random_seed 0 1 2 3 4
Train GNN models parallel using multi-gpu
python GNN_sample_large.py --task bot --relation_select 0 1 2 3 4 4 6 --model RGT --GPU_num 4
python GNN_sample_large.py --task bot --relation_select 0 1 2 3 4 --model SHGN --GPU_num 4
python GNN_sample_large.py --task stance --relation_select 0 1 --model GCN --GPU_num 4
python GNN_sample_large.py --task stance --relation_select 0 --model GAT --GPU_num 4
Baseline performance
Stance detection performance on MGTAB
methods | type | accuracy | precision | recall | f1-score |
---|---|---|---|---|---|
AdaBoost | F | 74.59 $_{1.41}$ | 74.60 $_{1.35}$ | 74.02 $_{1.61}$ | 73.88 $_{1.47}$ |
Random Forest | F | 79.62 $_{0.68}$ | 80.04 $_{0.43}$ | 78.83 $_{0.98}$ | 79.04 $_{0.82}$ |
Decision Tree | F | 66.92 $_{0.93}$ | 66.34 $_{1.02}$ | 66.23 $_{1.06}$ | 66.03 $_{0.84}$ |
SVM | F | 81.23 $_{0.66}$ | 81.40 $_{0.71}$ | 80.86 $_{1.09}$ | 80.71 $_{0.78}$ |
KNN | F | 76.25 $_{1.32}$ | 75.54 $_{1.41}$ | 75.70 $_{1.37}$ | 75.48 $_{1.37}$ |
Logistic Regression | F | 79.51 $_{1.00}$ | 79.33 $_{0.98}$ | 78.83 $_{1.17}$ | 78.98 $_{1.11}$ |
GCN | G | 81.35 $_{0.58}$ | 81.08 $_{0.30}$ | 80.19 $_{0.56}$ | 80.08 $_{0.56}$ |
GrapgSAGE | G | 83.33 $_{1.22}$ | 82.52 $_{1.63}$ | 83.45 $_{0.63}$ | 82.72 $_{1.34}$ |
GAT | G | 82.19 $_{1.23}$ | 81.72 $_{1.19}$ | 81.68 $_{1.16}$ | 81.04 $_{1.24}$ |
HGT | G | 83.29 $_{0.44}$ | 81.63 $_{0.58}$ | 81.51 $_{0.76}$ | 81.82 $_{0.34}$ |
S-HGN | G | 85.32 $_{0.53}$ | 83.93 $_{0.67}$ | 83.65 $_{0.65}$ | 84.42 $_{0.43}$ |
BotRGCN | G | 84.71 $_{1.43}$ | 83.43 $_{1.23}$ | 84.08 $_{0.94}$ | 84.30 $_{1.44}$ |
RGT | G | 87.78 $_{0.43}$ | 85.22 $_{0.89}$ | 84.40 $_{0.74}$ | 86.86 $_{0.43}$ |
Bot detection performance on MGTAB
methods | type | accuracy | precision | recall | f1-score |
---|---|---|---|---|---|
AdaBoost | F | 90.12 $_{0.92}$ | 88.51 $_{1.33}$ | 89.10 $_{0.92}$ | 87.71 $_{1.10}$ |
Random Forest | F | 89.52 $_{0.44}$ | 88.92 $_{0.49}$ | 86.72 $_{1.15}$ | 86.83 $_{0.53}$ |
Decision Tree | F | 87.13 $_{0.51}$ | 83.81 $_{0.72}$ | 83.39 $_{1.06}$ | 83.70 $_{0.74}$ |
SVM | F | 88.68 $_{1.40}$ | 85.73 $_{1.84}$ | 85.73 $_{1.84}$ | 85.31 $_{1.73}$ |
KNN | F | 85.78 $_{0.84}$ | 82.28 $_{1.22}$ | 80.49 $_{0.64}$ | 81.28 $_{0.66}$ |
Logistic Regression | F | 88.49 $_{1.31}$ | 85.69 $_{1.69}$ | 84.41 $_{1.96}$ | 84.97 $_{1.67}$ |
GCN | G | 85.81 $_{1.32}$ | 77.40 $_{2.12}$ | 84.37 $_{1.73}$ | 78.33 $_{1.67}$ |
GrapgSAGE | G | 88.71 $_{1.24}$ | 85.33 $_{1.83}$ | 86.15 $_{2.55}$ | 85.44 $_{1.08}$ |
GAT | G | 86.96 $_{1.28}$ | 79.71 $_{2.96}$ | 84.88 $_{1.52}$ | 82.33 $_{2.12}$ |
HGT | G | 90.28 $_{0.29}$ | 85.35 $_{0.33}$ | 85.97 $_{0.41}$ | 87.52 $_{0.37}$ |
S-HGN | G | 91.42 $_{0.43}$ | 87.40 $_{0.67}$ | 86.73 $_{0.64}$ | 88.72 $_{0.58}$ |
BotRGCN | G | 89.60 $_{0.82}$ | 85.21 $_{1.81}$ | 87.07 $_{1.38}$ | 87.16 $_{0.74}$ |
RGT | G | 92.12 $_{0.37}$ | 88.08 $_{0.43}$ | 86.64 $_{0.25}$ | 90.41 $_{0.47}$ |
Licensing
The MGTAB dataset uses the CC BY-NC-ND 4.0 license. Implemented code in the MGTAB evaluation framework uses the MIT license.
Datasets download
For SemEval-2016 T6, visit the SemEval2016 repository. For SemEval-2019 T7, visit the SemEval2019 github repository. For TwiBot-20, visit the TwiBot-20 github repository. For TwiBot-22, visit the TwiBot-22 github repository. For other bot detection datasets, please visit the Bot Repository.
MGTAB is available at Google Drive. MGTAB-large (contains 400,000 unlabeled users) is available at Google Drive. We also offer the standardized Cresci-15 at Google Drive. After downloading these datasets, please unzip it into path "./Dataset".