graphium icon indicating copy to clipboard operation
graphium copied to clipboard

Graphium 3.0

Open DomInvivo opened this issue 7 months ago • 1 comments

Changelogs

Moving to Graphium 3.0! This will be a large PR that basically regroups many other PRs there. So for specific changes, please consult the original PRs.

  • See PR #510 for the changes in C++
  • See PR #517 for the changes related to torchmetrics
  • See PR #521 for the changes to the node ordering
  • TODO: link to other PR's when available

discussion related to that PR

Todo's before merging

Before merging Graphium 3.0, we need to do the following tests.

Validating the C++

Assigned to @WenkelF and @AnujaSomthankar, with @ndickson-nvidia to help fix issues if any

The C++ changes were brought by the PR #510 , and there were a lot of unit-tests and validation using the ToyMix dataset. What's left to be done is validating that we can reproduce the experimental results of pre-training a MolGPS model.

  • [x] Train a 10M model on LargeMix and validate the pre-train and finetune performance match the graphium 2.X
  • [x] ~~Validate that training is faster~~ Training is not faster on our cluster, but expected to be faster where disk reading is limiting the speed.
  • [ ] Validate that we can run inference on a new dataset without caching (since caching is only for labels)
  • [ ] Train a 1B model and make sure that we match the metrics
  • [ ] Validate that the finetuning performance is consistent
  • [ ] Make sure the documentation in the Readme.md contains all info for installing the C++ libraries
  • [ ] Clearly documenting the inputs / outputs of every C++ function @ndickson-nvidia (see PR #521 )

Validating the torchmetrics

Assigned to @WenkelF and @AnujaSomthankar, with @DomInvivo to help fix issues if any

  • [x] First test the torchmetrics PR #517 independantly
  • [x] Train a 10M model on LargeMix and validate that the metrics are the same as graphium 2.0, and that it is same speed and RAM is lower (during validation and testing)
  • [x] Validate that the mean_pred, mean_target, grad_norm, train_loss, train_loss_{task}, and other metrics all get logged properly to Wandb
  • [ ] Then merge with the C++ changes on the graphium 3.0 branch, and test again

Improving the cuda support

Assigned to @WenkelF and @AnujaSomthankar

  • [ ] Validate the multi-gpu training of the 10M model with DDP 4 gpus, and that results are consistent with 1 gpu
  • [ ] Bump the version of cuda-version and remove restriction to 11.2. Close #512

Fixing the node ordering issue

Assigned to @ndickson-nvidia , see PR #521

  • [x] Resolve issue #502 with PR #521
  • [x] Add unit-tests for the node ordering issue

Support for Mixed precision

Supporting mixed precision should be easy with lightning. However, we face issues that some of the tasks are very sparse and require float32. What we suggest is to have a custom mixed-precision that doesn't affect the task heads, but only the body of the GNN network.

  • [ ] Implement custom mixed precision.

Removing the IPU support #525

Assigned to @DomInvivo

Since Graphcore is no longer maintaining IPU support in lightning, it is best to remove it from Graphium 3.0. It will stay compatible with 2.0, and can be brought back if necessary, afterwards. (We got the approval from GraphCore for this)

  • [x] Remove custom IPU functions
  • [x] Remove Lightning wrappers for IPUs
  • [x] Remove actions and unit-tests for IPUs

Command line

Assigned to @WenkelF

  • [ ] Some command line improvements for training and finetuning
  • [ ] Improving documentation

Packaging

Assigned to @Andrewq11

  • [ ] Make sure the documentation, both readme and docs, are aligned with the latest changes
  • [ ] Make sure that the package can be installed via conda, and that C++ dependencies resolve automatically
  • [ ] Make sure that the package can be installed via pip, and that C++ dependencies resolve automatically
  • [ ] Make sure that we install the minimal amount of GCC compilers needed for the code to work
  • [ ] Make sure that we don't need to install graphium and graphium_cpp as 2 different packages
  • [ ] Build the documentation for the C++ part of the code so that it appears in the docs. ChatGPT says we can with the doxygen package
  • [ ] Support numpy >= 2.0

Polaris

  • [ ] Add data download from Polaris

Linting

  • [ ] Run black linting on the code. Wait for last-minute to avoid cluttering the PR.

DomInvivo avatar Jul 15 '24 14:07 DomInvivo