tutorials
tutorials copied to clipboard
Multi-GPU accelerator tutorial
Before submitting
- [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?
What does this PR do?
Adds a multi-GPU / multi-node tutorial book.
PR review
Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃
Hello @awaelchli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
lightning_examples/distributed-training/main.py
:
Line 30:121: E501 line too long (231 > 120 characters) Line 51:121: E501 line too long (497 > 120 characters) Line 53:121: E501 line too long (123 > 120 characters) Line 54:121: E501 line too long (222 > 120 characters) Line 74:121: E501 line too long (185 > 120 characters) Line 154:121: E501 line too long (139 > 120 characters) Line 187:121: E501 line too long (483 > 120 characters) Line 207:121: E501 line too long (280 > 120 characters) Line 209:121: E501 line too long (322 > 120 characters) Line 211:121: E501 line too long (386 > 120 characters) Line 228:121: E501 line too long (314 > 120 characters) Line 267:121: E501 line too long (281 > 120 characters) Line 269:121: E501 line too long (312 > 120 characters) Line 272:121: E501 line too long (376 > 120 characters) Line 274:121: E501 line too long (390 > 120 characters) Line 276:121: E501 line too long (197 > 120 characters) Line 314:121: E501 line too long (149 > 120 characters) Line 334:121: E501 line too long (177 > 120 characters) Line 337:121: E501 line too long (235 > 120 characters) Line 393:121: E501 line too long (193 > 120 characters) Line 408:121: E501 line too long (310 > 120 characters) Line 424:121: E501 line too long (193 > 120 characters) Line 451:121: E501 line too long (490 > 120 characters) Line 453:121: E501 line too long (275 > 120 characters) Line 455:121: E501 line too long (236 > 120 characters) Line 457:121: E501 line too long (346 > 120 characters) Line 467:121: E501 line too long (211 > 120 characters) Line 469:121: E501 line too long (184 > 120 characters) Line 497:121: E501 line too long (180 > 120 characters) Line 498:121: E501 line too long (130 > 120 characters) Line 499:121: E501 line too long (213 > 120 characters) Line 502:121: E501 line too long (348 > 120 characters) Line 609:121: E501 line too long (237 > 120 characters) Line 611:121: E501 line too long (281 > 120 characters) Line 613:121: E501 line too long (308 > 120 characters) Line 675:121: E501 line too long (152 > 120 characters) Line 686:121: E501 line too long (186 > 120 characters) Line 688:121: E501 line too long (372 > 120 characters) Line 716:121: E501 line too long (273 > 120 characters) Line 732:121: E501 line too long (210 > 120 characters) Line 746:121: E501 line too long (275 > 120 characters) Line 777:121: E501 line too long (206 > 120 characters) Line 784:1: E402 module level import not at top of file
Comment last updated at 2021-07-07 15:35:13 UTC
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@awaelchli The example of lightning-tutorials/lightning_examples/distributed-training/main.ipynb is very detailed and that's good. However, it doesn't show how a multi-node job should be executed (e.g. with mpirun). On K8S cluster, one would typically gather the IP-addresses of the pods of an deployment (dedicated to the intended multi-node training execution) and set up passwordless ssh communication between the pods. How should PL be called so that this list of hostnames is available to it?
However, it doesn't show how a multi-node job should be executed (e.g. with mpirun).
that is quite a limitation of actual CI/CD tooling, that we are running notebooks on single-node multi--GPUs :rabbit:
Codecov Report
Merging #52 (c49baa7) into main (0b676fa) will not change coverage. The diff coverage is
n/a
.
Additional details and impacted files
@@ Coverage Diff @@
## main #52 +/- ##
==================================
Coverage 73% 73%
==================================
Files 2 2
Lines 382 382
==================================
Hits 280 280
Misses 102 102
@awaelchli was it outdated or just long pending?