GraphCLIP
GraphCLIP copied to clipboard
Official implementation of GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs
GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs

Updates
- [x] [2024.11.01] We have uploaded the source datasets.
- [x] [2024.11.03] We have uploaded the target datasets and the pretrained checkpoint.
- [ ] How to apply GraphCLIP on customized datasets.
1. Environment setup
conda create -n graphclip python=3.10
conda activate graphclip
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install torch_geometric
2. Datasets
For source data
- This repository includes the smallest source dataset, i.e., pubmed. For larger-scale source datasets, please download generated graph summaries:
| Datasets | Links |
|---|---|
| OGBN-ArXiv | Google Drive |
| ArXiv_2023 | Google Drive |
| Google Drive | |
| OGBN-Products | Google Drive |
-
Once downloaded, unzip the files and place them in the
summarydirectory. -
For convenience, we also provide the processed data, which includes the graph structure and node features. Please download them following:
| Datasets | Links |
|---|---|
| OGBN-ArXiv | Google Drive |
| ArXiv_2023 | Google Drive |
| Google Drive | |
| OGBN-Products | Google Drive |
- Once downloaded, unzip the files and place them in the
processed_datadirectory.
For target data
- For target datasets, we only need to download processed data, unzip them and put them into
processed_datadirectory.
| Datasets | Links |
|---|---|
| WikiCS | Google Drive |
| Google Drive | |
| Ele-Photo | Google Drive |
| Ele-Computers | Google Drive |
| Books-History | Google Drive |
- As for subgraphs, please run
bash gen_target_subg.shto generate subgraphs for each target dataset.
3. Pretraining on source data
Downloading and loading our pretrained checkpoint
To get started, download our released checkpoint and unzip the content. Place the extracted files into the checkpoints directory. You can then use this checkpoint directly on your target datasets, as outlined in the next section.
Or training from scratch
# We provide the smallest source data (pubmed) for running our codes
# single gpu
CUDA_VISIBLE_DEVICES=0 python train.py --source_data pubmed --batch_size 1024 --epochs 30
# multiple gpus
CUDA_VISIBLE_DEVICES=0,1 python train.py --source_data pubmed --batch_size 1024 --epochs 30
# reproduce our results
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --source_data ogbn-arxiv+arxiv_2023+pubmed+ogbn-products+reddit --batch_size 7200 --epochs 30
We use 8 A100(40G) GPUs for pretraining with ~7 hours
This code supports Data Parallel, you can assign multiple gpus here.
4. Zero-shot learning on target data
We provide a sample target dataset (citeseer) for running our code. By default, this will load your pretrained checkpoint.
CUDA_VISIBLE_DEVICES=0 python eval.py --target_data citeseer
more target datasets can be evaluated: --target_data cora+citeseer+wikics+instagram+photo+computer+history
To reproduce our experiments, use the --ckpt flag to specify the pretrained checkpoint, and provide the name of the downloaded checkpoint.
CUDA_VISIBLE_DEVICES=0 python eval.py --target_data cora+citeseer+wikics+instagram+photo+computer+history --ckpt pretrained_graphclip
(a) Generate graph summaries locally
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python generate_summary.py --dataset arxiv_2023 --walk_step 128 --restart 0.8