tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

Multi-GPU accelerator tutorial

Open awaelchli opened this issue 3 years ago • 4 comments

Before submitting

  • [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • [ ] Did you make sure to update the docs?
  • [ ] Did you write any new necessary tests?

What does this PR do?

Adds a multi-GPU / multi-node tutorial book.

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

awaelchli avatar Jul 05 '21 03:07 awaelchli

Hello @awaelchli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 30:121: E501 line too long (231 > 120 characters) Line 51:121: E501 line too long (497 > 120 characters) Line 53:121: E501 line too long (123 > 120 characters) Line 54:121: E501 line too long (222 > 120 characters) Line 74:121: E501 line too long (185 > 120 characters) Line 154:121: E501 line too long (139 > 120 characters) Line 187:121: E501 line too long (483 > 120 characters) Line 207:121: E501 line too long (280 > 120 characters) Line 209:121: E501 line too long (322 > 120 characters) Line 211:121: E501 line too long (386 > 120 characters) Line 228:121: E501 line too long (314 > 120 characters) Line 267:121: E501 line too long (281 > 120 characters) Line 269:121: E501 line too long (312 > 120 characters) Line 272:121: E501 line too long (376 > 120 characters) Line 274:121: E501 line too long (390 > 120 characters) Line 276:121: E501 line too long (197 > 120 characters) Line 314:121: E501 line too long (149 > 120 characters) Line 334:121: E501 line too long (177 > 120 characters) Line 337:121: E501 line too long (235 > 120 characters) Line 393:121: E501 line too long (193 > 120 characters) Line 408:121: E501 line too long (310 > 120 characters) Line 424:121: E501 line too long (193 > 120 characters) Line 451:121: E501 line too long (490 > 120 characters) Line 453:121: E501 line too long (275 > 120 characters) Line 455:121: E501 line too long (236 > 120 characters) Line 457:121: E501 line too long (346 > 120 characters) Line 467:121: E501 line too long (211 > 120 characters) Line 469:121: E501 line too long (184 > 120 characters) Line 497:121: E501 line too long (180 > 120 characters) Line 498:121: E501 line too long (130 > 120 characters) Line 499:121: E501 line too long (213 > 120 characters) Line 502:121: E501 line too long (348 > 120 characters) Line 609:121: E501 line too long (237 > 120 characters) Line 611:121: E501 line too long (281 > 120 characters) Line 613:121: E501 line too long (308 > 120 characters) Line 675:121: E501 line too long (152 > 120 characters) Line 686:121: E501 line too long (186 > 120 characters) Line 688:121: E501 line too long (372 > 120 characters) Line 716:121: E501 line too long (273 > 120 characters) Line 732:121: E501 line too long (210 > 120 characters) Line 746:121: E501 line too long (275 > 120 characters) Line 777:121: E501 line too long (206 > 120 characters) Line 784:1: E402 module level import not at top of file

Comment last updated at 2021-07-07 15:35:13 UTC

pep8speaks avatar Jul 05 '21 03:07 pep8speaks

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@awaelchli The example of lightning-tutorials/lightning_examples/distributed-training/main.ipynb is very detailed and that's good. However, it doesn't show how a multi-node job should be executed (e.g. with mpirun). On K8S cluster, one would typically gather the IP-addresses of the pods of an deployment (dedicated to the intended multi-node training execution) and set up passwordless ssh communication between the pods. How should PL be called so that this list of hostnames is available to it?

mattiasmar avatar Oct 25 '21 17:10 mattiasmar

However, it doesn't show how a multi-node job should be executed (e.g. with mpirun).

that is quite a limitation of actual CI/CD tooling, that we are running notebooks on single-node multi--GPUs :rabbit:

Borda avatar Dec 01 '21 21:12 Borda

Codecov Report

Merging #52 (c49baa7) into main (0b676fa) will not change coverage. The diff coverage is n/a.

Additional details and impacted files
@@        Coverage Diff         @@
##           main   #52   +/-   ##
==================================
  Coverage    73%   73%           
==================================
  Files         2     2           
  Lines       382   382           
==================================
  Hits        280   280           
  Misses      102   102           

codecov[bot] avatar May 03 '23 17:05 codecov[bot]

@awaelchli was it outdated or just long pending?

Borda avatar Oct 23 '23 06:10 Borda