graphium icon indicating copy to clipboard operation
graphium copied to clipboard

Graphium 3.0

Open DomInvivo opened this issue 1 year ago • 2 comments

Changelogs

Moving to Graphium 3.0! This will be a large PR that basically regroups many other PRs there. So for specific changes, please consult the original PRs.

  • See PR #510 for the changes in C++
  • See PR #517 for the changes related to torchmetrics
  • See PR #521 for the changes to the node ordering
  • TODO: link to other PR's when available

discussion related to that PR

Todo's before merging

Before merging Graphium 3.0, we need to do the following tests.

Validating the C++

Assigned to @WenkelF and @AnujaSomthankar, with @ndickson-nvidia to help fix issues if any

The C++ changes were brought by the PR #510 , and there were a lot of unit-tests and validation using the ToyMix dataset. What's left to be done is validating that we can reproduce the experimental results of pre-training a MolGPS model.

  • [x] Train a 10M model on LargeMix and validate the pre-train and finetune performance match the graphium 2.X
  • [x] ~~Validate that training is faster~~ Training is not faster on our cluster, but expected to be faster where disk reading is limiting the speed.
  • [x] Validate that we can run inference on a new dataset without caching (since caching is only for labels)
  • [x] Train a 1B model and make sure that we match the metrics
  • [x] Validate that the finetuning performance is consistent
  • [x] Make sure the documentation in the Readme.md contains all info for installing the C++ libraries
  • [x] Clearly documenting the inputs / outputs of every C++ function @ndickson-nvidia (see PR #521 )

Validating the torchmetrics

Assigned to @WenkelF and @AnujaSomthankar, with @DomInvivo to help fix issues if any

  • [x] First test the torchmetrics PR #517 independantly
  • [x] Train a 10M model on LargeMix and validate that the metrics are the same as graphium 2.0, and that it is same speed and RAM is lower (during validation and testing)
  • [x] Validate that the mean_pred, mean_target, grad_norm, train_loss, train_loss_{task}, and other metrics all get logged properly to Wandb
  • [x] Then merge with the C++ changes on the graphium 3.0 branch, and test again

Improving the cuda support

Assigned to @WenkelF and @AnujaSomthankar

  • [x] Validate the multi-gpu training of the 10M model with DDP 4 gpus, and that results are consistent with 1 gpu
  • [x] Bump the version of cuda-version and remove restriction to 11.2. Close #512

Fixing the node ordering issue

Assigned to @ndickson-nvidia , see PR #521

  • [x] Resolve issue #502 with PR #521
  • [x] Add unit-tests for the node ordering issue

Support for Mixed precision

Supporting mixed precision should be easy with lightning. However, we face issues that some of the tasks are very sparse and require float32. What we suggest is to have a custom mixed-precision that doesn't affect the task heads, but only the body of the GNN network.

  • ~~[ ] Implement custom mixed precision.~~

Removing the IPU support #525

Assigned to @DomInvivo

Since Graphcore is no longer maintaining IPU support in lightning, it is best to remove it from Graphium 3.0. It will stay compatible with 2.0, and can be brought back if necessary, afterwards. (We got the approval from GraphCore for this)

  • [x] Remove custom IPU functions
  • [x] Remove Lightning wrappers for IPUs
  • [x] Remove actions and unit-tests for IPUs

Command line

Assigned to @WenkelF

  • [x] Some command line improvements for training and finetuning
  • [x] Improving documentation

Packaging

Assigned to @Andrewq11

  • [x] Make sure the documentation, both readme and docs, are aligned with the latest changes
  • [x] Make sure that the package can be installed via conda, and that C++ dependencies resolve automatically
  • ~~[ ] Make sure that the package can be installed via pip~~ the pip package doesn't have headers, so we can't release there.
  • [x] Make sure that we install the minimal amount of GCC compilers needed for the code to work
  • [x] Make sure that we don't need to install graphium and graphium_cpp as 2 different packages
  • [x] Build the documentation for the C++ part of the code so that it appears in the docs. ChatGPT says we can with the doxygen package
  • ~~[ ] Support numpy >= 2.0~~ Need to wait for Pyg to support numpy>=2.0

For now, we are constrained by the rdkit version == 2024.03.4 due to missing headers in more recent releases. And also can only support conda due to missing headers for pip.

Polaris

  • ~~[ ] Add data download from Polaris~~ Not for initial release

Linting

  • [ ] Run black linting on the code. Wait for last-minute to avoid cluttering the PR.

2024-11-14 update of what's missing

  • [ ] Header comments for the C++ files @ndickson-nvidia
  • [x] Looking at the rdkit pinning issue to see if we can support more versions, and open an issue -> pinning removed!

DomInvivo avatar Jul 15 '24 14:07 DomInvivo

Codecov Report

Attention: Patch coverage is 87.11656% with 42 lines in your changes missing coverage. Please review.

Project coverage is 67.99%. Comparing base (3efa7f1) to head (47b7d1c). Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #519      +/-   ##
==========================================
- Coverage   71.52%   67.99%   -3.54%     
==========================================
  Files          93       84       -9     
  Lines        8708     7354    -1354     
==========================================
- Hits         6228     5000    -1228     
+ Misses       2480     2354     -126     
Flag Coverage Δ
unittests 67.99% <87.11%> (-3.54%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
ipu 29.75% <ø> (-19.40%) :arrow_down:

codecov[bot] avatar Aug 09 '24 03:08 codecov[bot]

PR for packaging tasks here: https://github.com/datamol-io/graphium/pull/531

Andrewq11 avatar Nov 06 '24 22:11 Andrewq11