A collection of graph data used for semi-supervised node classification.


  • GraphData
    • Glance of graphs
    • Single Graph
      • Planetoid datasets: CORA, CiteSeer and PubMed
      • Amazon Computers and Amazon Photo
      • Coauthor CS and Coauthor Physics
      • DBLP
      • CiteSeer_Full
      • CORA_Full and CORA-ML
      • Reddit
        • Source-SNAP
        • Source-DGL
        • Source-TUM
      • NELL
      • Flickr and BlogCatalog
      • KDD Cup 2020
      • UAI2010
      • ACM
      • MAG-Scholar
      • Karateclub Datasets
        • Node-level
        • Graph-level

Usage of .npz datasets

import os.path as osp
import numpy as np

def load_npz(filepath):
    filepath = osp.abspath(osp.expanduser(filepath))

    if not filepath.endswith('.npz'):
        filepath = filepath + '.npz'
    if osp.isfile(filepath):
        with np.load(filepath, allow_pickle=True) as loader:
            loader = dict(loader)
            for k, v in loader.items():
                if v.dtype.kind in {'O', 'U'}:
                    loader[k] = v.tolist()
            return loader
        raise ValueError(f"{filepath} doesn't exist.")

e.g., run load_npz('cora') and it returns a dict instance loader, it might have the following keys:

  • adj_matrix: scipy.sparse.csr_matrix, adjacency matrix. NOTE: the adjacency matrix might not be symmetric.
  • node_attr: scipy.sparse.csr_matrix or numpy.ndarray, node attribute matrix
  • node_label: scipy.sparse.csr_matrix or numpy.ndarray, node labels
  • metadata: dict, additional metadata such as text.

Glance of graphs

name num_nodes num_edges num_attrs density is_directed
karate_club 34 78 0 6.7474% 0
polblogs 1,490 19,025 0 0.8569% 1
cora 2,708 5,429 1,433 0.0740% 1
cora_ml 2,995 8,416 2,879 0.0938% 1
acm 3,025 13,128 1,870 0.1435% 0
uai 3,067 28,314 4,973 0.3010% 0
citeseer 3,312 4,715 3,703 0.0430% 1
citeseer_full 4,230 5,358 602 0.0299% 1
blogcatalog 5,196 171,743 8,189 0.6361% 0
flickr 7,575 239,738 12,047 0.4178% 0
amazon_photo 7,650 143,663 745 0.2455% 1
amazon_cs 13,752 287,209 767 0.1519% 1
dblp 17,716 52,867 1,639 0.0168% 0
coauthor_cs 18,333 81,894 6,805 0.0244% 0
pubmed 19,717 44,324 500 0.0114% 0
cora_full 19,793 65,311 8,710 0.0167% 1
coauthor_phy 34,493 247,962 8,415 0.0208% 0

Single Graph

Planetoid datasets: CORA, CiteSeer, PubMed and Nelll

citation network and knowledge graph(NELL) datasets in

nodes are documents and edges are citation links. Label rate denotes the number of labeled nodes that are used for training divided by the total number of nodes in each dataset.

Amazon Computers and Amazon Photo

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph,where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Coauthor CS and Coauthor Physics

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

The above datasets are collected from

  title={Pitfalls of Graph Neural Network Evaluation},
  author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
  journal={Relational Representation Learning Workshop, NeurIPS 2018},


CORA_Full, citation network dataset, an extended version of CORA

CORA-ML, extracted from the original data the entire network of CORA

The above datasets are collected from

title={Deep Gaussian Embedding of Graphs:  Unsupervised Inductive Learning via Ranking},
author={Aleksandar Bojchevski and Stephan Günnemann},
booktitle={International Conference on Learning Representations},


233K nodes, 11.6M edges, 602 node features



Flickr and BlogCatalog

BlogCatalog : It is a dataset of a blog community social network, which contains 5,196 users as nodes, 171,743 edges indicating the user interactions, and 8,189 attribute categories denoting the keywords of their blogs. Users could register their blogs into six different predefined classes, which are set as labels.

Flickr: It is a benchmark attributed social network dataset containing 7,575 nodes. Each node is a Flickr user and each attribute category is a tag related to the photos shared by users. There are 239,738 undirected edges in this network, which indicate the following relationships among users. The nine groups that users have joined are considered as target labels.

both datasets are collected form

KDD Cup 2020

  • KDDS1

  • KDDS2 the label of the last 50,000 test nodes are released at



  title={A Unified Weakly Supervised Framework for Community Detection and Semantic Matching},
  author={Wang, Wenjun and Liu, Xiao and Jiao, Pengfei and Chen, Xue and Jin, Di},
  booktitle={Pacific-Asia Conference on Knowledge Discovery and Data Mining},


This network is extracted from ACM dataset where nodes represent papers and there is an edge between two papers if they have the same author. All the papers are divided into 3 classes (Database, Wireless Communication, DataMining). The features are the bag-of-words representa- tions of paper keywords.

Link1:

title={Heterogeneous Graph Attention Network},
title={Heterogeneous Graph Attention Network},
author={Xiao, Wang and Houye, Ji and Chuan, Shi and  Bai, Wang and Peng, Cui and P. , Yu and Yanfang, Ye},

Link2:

a new benchmark dataset based on the Microsoft Academic Graph (MAG). Nodes represent papers, edges denote citations, and node features correspond to a bag-of-words representation of paper abstracts. The graph is augmented with "groundtruth" node labels corresponding to the papers’ field of study.

Link:

Karateclub Datasets

Datasets from Karateclub


  • deezer
  • facebook
  • github
  • lastfm
  • twitch
  • wikipedia


  • reddit10k


FB15K_URL = "" FILENAMES = [ "FB15k/freebase_mtr100_mte100-train.txt", "FB15k/freebase_mtr100_mte100-valid.txt", "FB15k/freebase_mtr100_mte100-test.txt", ]


Datasets from MUSAE

Synthetic node classification dataset from PDN:

