graphium
graphium copied to clipboard
Graphium 3.0
Changelogs
Moving to Graphium 3.0! This will be a large PR that basically regroups many other PRs there. So for specific changes, please consult the original PRs.
- See PR #510 for the changes in C++
- See PR #517 for the changes related to torchmetrics
- See PR #521 for the changes to the node ordering
- TODO: link to other PR's when available
discussion related to that PR
Todo's before merging
Before merging Graphium 3.0, we need to do the following tests.
Validating the C++
Assigned to @WenkelF and @AnujaSomthankar, with @ndickson-nvidia to help fix issues if any
The C++ changes were brought by the PR #510 , and there were a lot of unit-tests and validation using the ToyMix dataset. What's left to be done is validating that we can reproduce the experimental results of pre-training a MolGPS model.
- [x] Train a 10M model on LargeMix and validate the pre-train and finetune performance match the graphium 2.X
- [x] ~~Validate that training is faster~~ Training is not faster on our cluster, but expected to be faster where disk reading is limiting the speed.
- [x] Validate that we can run inference on a new dataset without caching (since caching is only for labels)
- [x] Train a 1B model and make sure that we match the metrics
- [x] Validate that the finetuning performance is consistent
- [x] Make sure the documentation in the Readme.md contains all info for installing the C++ libraries
- [x] Clearly documenting the inputs / outputs of every C++ function @ndickson-nvidia (see PR #521 )
Validating the torchmetrics
Assigned to @WenkelF and @AnujaSomthankar, with @DomInvivo to help fix issues if any
- [x] First test the torchmetrics PR #517 independantly
- [x] Train a 10M model on LargeMix and validate that the metrics are the same as graphium 2.0, and that it is same speed and RAM is lower (during validation and testing)
- [x] Validate that the mean_pred, mean_target, grad_norm, train_loss, train_loss_{task}, and other metrics all get logged properly to Wandb
- [x] Then merge with the C++ changes on the graphium 3.0 branch, and test again
Improving the cuda support
Assigned to @WenkelF and @AnujaSomthankar
- [x] Validate the multi-gpu training of the 10M model with DDP 4 gpus, and that results are consistent with 1 gpu
- [x] Bump the version of
cuda-versionand remove restriction to 11.2. Close #512
Fixing the node ordering issue
Assigned to @ndickson-nvidia , see PR #521
- [x] Resolve issue #502 with PR #521
- [x] Add unit-tests for the node ordering issue
Support for Mixed precision
Supporting mixed precision should be easy with lightning. However, we face issues that some of the tasks are very sparse and require float32. What we suggest is to have a custom mixed-precision that doesn't affect the task heads, but only the body of the GNN network.
- ~~[ ] Implement custom mixed precision.~~
Removing the IPU support #525
Assigned to @DomInvivo
Since Graphcore is no longer maintaining IPU support in lightning, it is best to remove it from Graphium 3.0. It will stay compatible with 2.0, and can be brought back if necessary, afterwards. (We got the approval from GraphCore for this)
- [x] Remove custom IPU functions
- [x] Remove Lightning wrappers for IPUs
- [x] Remove actions and unit-tests for IPUs
Command line
Assigned to @WenkelF
- [x] Some command line improvements for training and finetuning
- [x] Improving documentation
Packaging
Assigned to @Andrewq11
- [x] Make sure the documentation, both readme and docs, are aligned with the latest changes
- [x] Make sure that the package can be installed via conda, and that C++ dependencies resolve automatically
- ~~[ ] Make sure that the package can be installed via pip~~ the pip package doesn't have headers, so we can't release there.
- [x] Make sure that we install the minimal amount of GCC compilers needed for the code to work
- [x] Make sure that we don't need to install
graphiumandgraphium_cppas 2 different packages - [x] Build the documentation for the C++ part of the code so that it appears in the docs. ChatGPT says we can with the doxygen package
- ~~[ ] Support numpy >= 2.0~~ Need to wait for Pyg to support
numpy>=2.0
For now, we are constrained by the rdkit version == 2024.03.4 due to missing headers in more recent releases. And also can only support conda due to missing headers for pip.
Polaris
- ~~[ ] Add data download from Polaris~~ Not for initial release
Linting
- [ ] Run
blacklinting on the code. Wait for last-minute to avoid cluttering the PR.
2024-11-14 update of what's missing
- [ ] Header comments for the C++ files @ndickson-nvidia
- [x] Looking at the rdkit pinning issue to see if we can support more versions, and open an issue -> pinning removed!
Codecov Report
Attention: Patch coverage is 87.11656% with 42 lines in your changes missing coverage. Please review.
Project coverage is 67.99%. Comparing base (
3efa7f1) to head (47b7d1c). Report is 9 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #519 +/- ##
==========================================
- Coverage 71.52% 67.99% -3.54%
==========================================
Files 93 84 -9
Lines 8708 7354 -1354
==========================================
- Hits 6228 5000 -1228
+ Misses 2480 2354 -126
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 67.99% <87.11%> (-3.54%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| ipu | 29.75% <ø> (-19.40%) |
:arrow_down: |
PR for packaging tasks here: https://github.com/datamol-io/graphium/pull/531