Deprecate DGL: freeze on CPU (torch >=2.5) + slow ABI movement

Open lmeyerov opened this issue 1 month ago • 0 comments

Summary

DGL support on CPU is effectively frozen and lags Torch ABI upgrades. The latest CPU wheels stop at DGL 2.1.0 (Torch ~2.0). Newer DGL versions (e.g., 2.4.0) publish only CUDA wheels. With Torch 2.5–2.9, we cannot run DGL-based features or tests on CPU without building DGL from source. We should move to PyTorch Geometric, which is actively maintained.

Evidence

DGL wheel index (https://data.dgl.ai/wheels/repo.html) shows CPU wheels only up to 2.1.0; 2.4.0 is CUDA-only.
Installing Torch 2.8/2.9 + DGL CPU fails (GraphBolt → torchdata pins to Torch 2.0.x).
Our server Dockerfile uses Torch 2.9.0 + DGL 2.4.0 cu124 (GPU works), but CI CPU runners cannot match it.
No CPU ABI updates from DGL for ~16 months.

Impact

DGL-dependent tests (embed_utils, networks) break or get skipped on CPU with current Torch.
CI CPU matrix cannot validate DGL paths, increasing regression risk.

Proposal

Begin deprecating DGL and plan a migration to PyTorch Geometric (PyG) for GNN features.
Document the supported Torch/DGL matrix:
- CPU: Torch ~2.0 + DGL 2.1.0 only
- GPU: DGL 2.4.0 cu124 + Torch 2.8/2.9
Mark CPU DGL tests as xfail or disable them until migration.

Tasks

Update CI and guides to align with the supported matrix; explicitly skip CPU DGL tests.
Draft the PyG migration plan (feature parity, loaders, batching).
Update documentation to reflect DGL limitations and upcoming deprecation.

Nov 30 '25 07:11 lmeyerov