giotto-deep
giotto-deep copied to clipboard
Parallel training
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [] New feature (non-breaking change which adds functionality)
- [x] Breaking change (fix or feature that would cause existing functionality to change)
Description This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools.
The version of pytorch is increased to 1.13.1 to support the necessary features of FSDP.
To allow the parallelisation to be efficiently run, a new sampler GiottoSampler
is defined that combines DistributedSampler
and SubsetRandomSampler
.
A benchmark tool allows running a model with different batch sizes on different GPUs and number of GPUs to compare the parallelised and not-parallelised runs. A generator of Kubernetes pods takes some GKE details as input and outputs the pod configurations, allowing a user to build its own configurations for its own cluster.
Any other comments?
Checklist
- [x] I have read the guidelines for contributing.
- [ ] My code follows the code style of this project. I used
flake8
to check my Python changes. - [x] My change requires a change to the documentation.
- [x] I have updated the documentation accordingly.
- [ ] I have added tests to cover my changes.
- [x] All new and existing tests passed. I used
pytest
to check this on Python tests.
Check out this pull request onΒ
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@yorickbrunet Thanks for your contribution. Please let me know when the PR is ready for review.
@raphaelreinauer All the tests pass on linux containers. I suppose you can start your review. Thanks!
@yorickbrunet Thanks for your contribution. Please let me know when the PR is ready for review.
Hi @yorickbrunet, thank you so much for your hard work on this! π I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.
@raphaelreinauer I see that @yorickbrunet answered to all your comments. Are you satisfied or is there anything else you would like to discuss? If there is nothing more, I'll merge the PR.
@matteocao thanks for checking in and thanks @yorickbrunet for your answers so far.
However, there's one key aspect that still needs attention - the PR description:
Hi @yorickbrunet, thank you so much for your hard work on this! π I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.
@yorickbrunet Could you please add a detailed description of all the features you added, then I can do a detailed PR review.
@raphaelreinauer we answered or fixed all comments. Can you have another look, close the issues that can be closed, and maybe approve the PR? Thanks
Hi @yorickbrunet, thank you so much for your hard work on this! π I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.
Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks
Hi @yorickbrunet, thank you so much for your hard work on this! π I noticed the PR has a lot of file changes - 52, in fact! To help me, could you please add a description to the PR? A bit of context makes the review process much easier for me.
Hi @yorickbrunet, could you please provide a detailed description of the features added in this PR to aid my review process? Thanks
Hi @raphaelreinauer. I improved the general description of the PR, but the important description was there: This work introduces the parallelisation of giotto-deep on multiple GPUs via two methods: pytorch's FSDP and pipeline tools. Even though there are 52 files modified, the modifications are quite small in many of them. The most interesting file is gdeep/trainer/trainer.py
, where most of the modifications were done.
Thanks, @yorickbrunet, for the changes. I'll review the changes by next week Friday.
Unfortunately, I didn't have time to review the PR last weekend, but I'll do it this weekend. Sorry for the delay.
Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.
Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.
Hi @raphaelreinauer I answered all comments with either a modification of the code or some justification why it cannot/won't be modified. Basically, the project is closed and we cannot spend two more weeks adding tests, retesting the setup after modification, etc.
Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.
@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.
Hey @yorickbrunet and @matteocao, I noticed that there have yet to be any comments on the PR reviews. Could you give some comments? So that we can merge this PR.
@raphaelreinauer @VascoSch92 Can you please close all comments that you consider OK? So that we know where we stand. Thanks.
@yorickbrunet for me is good. You can resolve the discussions ;-) (i cannot)