pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

[docs] explain how to use `torchrun` in a SLURM environment

Open stas00 opened this issue 2 years ago • 4 comments

PL kept on failing to bind to a port in a slurm environment when I tried switching to torchrun.

I need the latter so that I could use --role \$(hostname -s): --tee 3 flags - a crucial feature which I can't find in PL's other launchers.

So after reading the source code I found how to hack around it and so documenting the hack for other users to find to save them lost time.

Thank you.


:books: Documentation preview :books:: https://pytorch-lightning--18614.org.readthedocs.build/en/18614/

stas00 avatar Sep 22 '23 02:09 stas00

Hmm, I'd like to finish this PR but I'm now not sure: Should we add a note to the SLURM docs that running with torchrun is also possible, or no note since it's anyway supported now? For inexperienced users, it's usually best to only suggest one way of doing the thing. So if we see a need for it to be extensively documented with an example, I'd go for a separate (sub) page. Wdyt?

awaelchli avatar Oct 06 '23 00:10 awaelchli

It probably depends on what's the most recommended way of running things. And it's in flux most of the time as ML evolves.

Normally, torchrun (and previously torch.launch) was the way to do more than one gpu. But now with frameworks like PTL and Accelerate providing their own launchers it's hard to tell what's most common or clear way to a beginner - I'd trust your judgment.

But as long as an advanced user can search PTL's docs website and find that torchrun is supported, that's probably good enough?

How do we tell if torchrun is more common

stas00 avatar Oct 06 '23 01:10 stas00

But as long as an advanced user can search PTL's docs website and find that torchrun is supported, that's probably good enough?

I would agree here :)

Borda avatar Nov 18 '23 08:11 Borda

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id Secret Commit Filename
- Generic High Entropy Secret 78fa3afdfbf964c19b4b2d36b91560698aa83178 tests/tests_app/utilities/test_login.py View secret
- Base64 Basic Authentication 78fa3afdfbf964c19b4b2d36b91560698aa83178 tests/tests_app/utilities/test_login.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

gitguardian[bot] avatar Jan 16 '24 09:01 gitguardian[bot]