pytorch-lightning
pytorch-lightning copied to clipboard
Azure OpenMPI Environment
What does this PR do?
- Adds a new
AzureOpenMPIEnvironmentcluster environment #14014 - The added local tests for AzureOpenMPIEnvironment pass, but it has not yet been tested on Azure to make sure the environment is properly detected.
Before submitting
- [x] Was this discussed/approved via a GitHub issue? (not for typos and docs)
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
- [ ] Did you make sure to update the documentation with your changes? (if necessary)
- [ ] Did you write any new necessary tests? (not for typos and docs)
- [ ] Did you verify new and existing tests pass locally with your changes?
- [ ] Did you list all the breaking changes introduced by this pull request?
- [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)
PR review
Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
- [ ] Is this pull request ready for review? (if not, please submit in draft mode)
- [ ] Check that all items from Before submitting are resolved
- [ ] Make sure the title is self-explanatory and the description concisely explains the PR
- [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified
@awaelchli
@awaelchli I've been testing this on Azure and the new AzureOpenMPIEnvironment environment is detected and behaves as expected with the exception of single node multi-gpu setups. For some reason, when testing with 1 node and 2 gpus I get errors related to master port regardless of what value I set it to. I tried statically setting master port either in the code or by setting the environmental variable MASTER_PORT. I also tried using the find_free_network_port() function from the lightning environment.
Azure does not set the AZ_BATCH_MASTER_NODE variable in a single node setting so you have to find another way to set the main port. This is what the error messages look like:
Driver 0
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:6105 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
self._strategy._worker_setup(process_idx)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 181, in _worker_setup
init_dist_connection(
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 374, in init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 160, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6105 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
Driver 1:
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 130, in _wrapping_function
results = function(*args, **kwargs)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1118, in _run
self.__setup_profiler()
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1745, in __setup_profiler
self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2218, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 245, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1869, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/opt/miniconda/envs/ptldev/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe
This may be a broader issue as I get these issues whether I use an MPIConfiguration or a PytorchConfiguration and whether I use this new Azure environment or not. Here is my code for reference:
import torch, os, sys
from torch.utils.data import DataLoader, Dataset
#from deepspeed.ops.adam import FusedAdam
#from azureml.core import Run, Workspace
from pytorch_lightning import LightningModule, Trainer, LightningDataModule, seed_everything
from pytorch_lightning.loggers import MLFlowLogger
from pytorch_lightning.plugins.environments import AzureOpenMPIEnvironment
#from pytorch_lightning.strategies import DeepSpeedStrategy
from argparse import ArgumentParser
device = 'gpu' if torch.cuda.is_available() else 'cpu'
print(f"device = {device}")
divider_str="-"*40
def get_env_display_text(var_name):
var_value = os.environ.get(var_name, "")
return f"{var_name} = {var_value}"
def display_environment(header='Environmental variables'):
"""
Print a few environment variables of note
"""
variable_names = [
"PL_GLOBAL_SEED",
"PL_SEED_WORKERS",
"AZ_BATCH_MASTER_NODE",
"AZ_BATCHAI_MPI_MASTER_NODE",
"MASTER_ADDR",
"MASTER_ADDRESS",
"MASTER_PORT",
"RANK",
"NODE_RANK",
"LOCAL_RANK",
"GLOBAL_RANK",
"WORLD_SIZE",
"NCCL_SOCKET_IFNAME",
"OMPI_COMM_WORLD_RANK",
"OMPI_COMM_WORLD_LOCAL_RANK",
"OMPI_COMM_WORLD_SIZE",
"OMPI_COMM_WORLD_LOCAL_SIZE",
"OMPI_COMM_WORLD_NODE_RANK",
"OMPI_UNIVERSE_SIZE"
]
var_text = "\n".join([get_env_display_text(var) for var in variable_names])
print(f"\n{header}:\n{divider_str}\n{var_text}\n{divider_str}\n")
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.save_hyperparameters()
self.model = torch.nn.Linear(32, 2)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss, on_step=True, on_epoch=True)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def configure_optimizers(self):
return torch.optim.AdamW(self.model.parameters())
#return FusedAdam(self.model.parameters())
def setup(self, stage=None) -> None:
# prevents hanging
if stage != "fit":
return
class DataModule(LightningDataModule):
def __init__(self):
super().__init__()
self.num_workers = os.cpu_count()
print(f"num_workers set to {self.num_workers}")
def setup(self, stage=None) -> None:
self._dataloader = DataLoader(
RandomDataset(32, 64),
num_workers=self.num_workers,
batch_size=1,
pin_memory=True
)
def train_dataloader(self):
return self._dataloader
def test_dataloader(self):
return self._dataloader
def val_dataloader(self):
return self._dataloader
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--num_nodes", default = 1, type=int)
parser.add_argument("--devices", default = 1, type=int, help='Number of devices per node')
args = parser.parse_args()
seed_everything(102938, workers = True)
model = BoringModel()
dm = DataModule()
#os.environ['MASTER_PORT'] = "6105"
display_environment("__main__")
trainer = Trainer(
num_nodes=args.num_nodes,
accelerator=device,
devices=args.devices,
precision=16,
limit_train_batches=2,
limit_val_batches=2,
log_every_n_steps=1,
logger=False,
enable_checkpointing=False,
num_sanity_val_steps=0,
max_epochs=2,
enable_model_summary=False,
#strategy = "deepspeed_stage_3",
#plugins=[AzureOpenMPIEnvironment(devices=args.devices)],
# strategy=DDPStrategy(
# cluster_environment = AzureOpenMPIEnvironment(devices=args.devices)
# ),
# strategy=DeepSpeedStrategy(
# stage = 3,
# cluster_environment = AzureOpenMPIEnvironment(devices=args.devices)
# ),
)
# extract cluster environment
trainer_cluster_environment = trainer._accelerator_connector.cluster_environment
print(f"trainer cluster environment: {trainer_cluster_environment}")
print(f"Was Azure OpenMPI environment used? {type(trainer_cluster_environment) == AzureOpenMPIEnvironment}")
trainer.fit(model, datamodule=dm)
print(f"""trainer.local_rank: {trainer.local_rank}
trainer.global_rank : {trainer.global_rank}
trainer.world_size : {trainer.world_size}
""")
Environment:
* CUDA:
- GPU:
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- available: True
- version: 11.3
* Lightning:
- lightning: 2022.8.18
- lightning-cloud: 0.5.3
- torch: 1.11.0
- torchaudio: 0.11.0
- torchmetrics: 0.9.3
- torchvision: 0.12.0
* Packages:
- absl-py: 1.2.0
- accelerate: 0.12.0
- adal: 1.2.7
- aiobotocore: 2.3.4
- aiohttp: 3.8.1
- aioitertools: 0.10.0
- aiosignal: 1.2.0
- anyio: 3.6.1
- argcomplete: 2.0.0
- asgiref: 3.5.2
- async-timeout: 4.0.2
- attrs: 22.1.0
- azure-common: 1.1.28
- azure-core: 1.25.0
- azure-graphrbac: 0.61.1
- azure-identity: 1.10.0
- azure-mgmt-authorization: 2.0.0
- azure-mgmt-containerregistry: 10.0.0
- azure-mgmt-core: 1.3.0
- azure-mgmt-keyvault: 10.1.0
- azure-mgmt-resource: 21.1.0
- azure-mgmt-storage: 20.0.0
- azure-storage-blob: 12.9.0
- azureml-core: 1.44.0
- azureml-dataprep: 4.2.2
- azureml-dataprep-native: 38.0.0
- azureml-dataprep-rslex: 2.8.1
- azureml-dataset-runtime: 1.44.0
- azureml-defaults: 1.44.0
- azureml-inference-server-http: 0.7.4
- azureml-mlflow: 1.44.0
- backports.tempfile: 1.0
- backports.weakref: 1.0.post1
- bcrypt: 3.2.2
- botocore: 1.24.21
- brotlipy: 0.7.0
- cachetools: 5.2.0
- certifi: 2022.6.15
- cffi: 1.15.0
- charset-normalizer: 2.0.4
- click: 8.1.3
- cloudpickle: 2.1.0
- commonmark: 0.9.1
- configparser: 3.7.4
- contextlib2: 21.6.0
- croniter: 1.3.5
- cryptography: 37.0.1
- databricks-cli: 0.17.1
- datasets: 2.4.0
- deepdiff: 5.8.1
- deepspeed: 0.7.0
- dill: 0.3.5.1
- distro: 1.7.0
- dnspython: 2.2.1
- docker: 5.0.3
- dotnetcore2: 3.1.23
- email-validator: 1.2.1
- entrypoints: 0.4
- fastapi: 0.79.0
- filelock: 3.8.0
- flask: 2.1.3
- flask-cors: 3.0.10
- frozenlist: 1.3.1
- fsspec: 2022.7.1
- fusepy: 3.0.1
- gitdb: 4.0.9
- gitpython: 3.1.27
- google-api-core: 2.8.2
- google-auth: 2.10.0
- google-auth-oauthlib: 0.4.6
- googleapis-common-protos: 1.56.4
- grpcio: 1.47.0
- gunicorn: 20.1.0
- h11: 0.13.0
- hjson: 3.1.0
- httptools: 0.4.0
- huggingface-hub: 0.8.1
- humanfriendly: 10.0
- idna: 3.3
- importlib-metadata: 4.12.0
- importlib-resources: 5.9.0
- inference-schema: 1.4.2
- isodate: 0.6.1
- itsdangerous: 2.1.2
- jeepney: 0.8.0
- jinja2: 3.1.2
- jmespath: 1.0.0
- joblib: 1.1.0
- json-logging-py: 0.2
- jsonpickle: 2.2.0
- jsonschema: 4.12.1
- knack: 0.9.0
- lightning: 2022.8.18
- lightning-cloud: 0.5.3
- markdown: 3.4.1
- markupsafe: 2.1.1
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- mlflow-skinny: 1.28.0
- msal: 1.18.0
- msal-extensions: 1.0.0
- msrest: 0.7.1
- msrestazure: 0.6.4
- multidict: 6.0.2
- multiprocess: 0.70.13
- ndg-httpsclient: 0.5.1
- ninja: 1.10.2.3
- numpy: 1.22.3
- oauthlib: 3.2.0
- opencensus: 0.11.0
- opencensus-context: 0.1.3
- opencensus-ext-azure: 1.1.6
- ordered-set: 4.1.0
- orjson: 3.7.12
- packaging: 21.3
- pandas: 1.4.3
- paramiko: 2.11.0
- pathspec: 0.9.0
- pillow: 9.0.1
- pip: 20.0.2
- pkginfo: 1.8.3
- pkgutil-resolve-name: 1.3.10
- portalocker: 2.5.1
- protobuf: 3.19.4
- psutil: 5.9.1
- py-cpuinfo: 8.0.0
- pyarrow: 9.0.0
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pydantic: 1.9.2
- pydeprecate: 0.3.2
- pygments: 2.13.0
- pyjwt: 2.4.0
- pynacl: 1.5.0
- pyopenssl: 22.0.0
- pyparsing: 3.0.9
- pyrsistent: 0.18.1
- pysocks: 1.7.1
- python-dateutil: 2.8.2
- python-dotenv: 0.20.0
- python-multipart: 0.0.5
- pytz: 2022.2.1
- pyyaml: 6.0
- requests: 2.27.1
- requests-oauthlib: 1.3.1
- responses: 0.18.0
- rich: 12.5.1
- rsa: 4.9
- s3fs: 2022.7.1
- scikit-learn: 1.1.2
- scipy: 1.9.0
- secretstorage: 3.3.3
- sentencepiece: 0.1.97
- setuptools: 61.2.0
- six: 1.16.0
- sklearn: 0.0
- smmap: 5.0.0
- sniffio: 1.2.0
- sqlparse: 0.4.2
- starlette: 0.20.4
- starsessions: 1.3.0
- tabulate: 0.8.10
- tensorboard: 2.10.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- threadpoolctl: 3.1.0
- torch: 1.11.0
- torchaudio: 0.11.0
- torchmetrics: 0.9.3
- torchvision: 0.12.0
- tqdm: 4.64.0
- traitlets: 5.3.0
- typing-extensions: 4.1.1
- ujson: 5.4.0
- urllib3: 1.26.9
- uvicorn: 0.17.6
- uvloop: 0.16.0
- watchgod: 0.8.2
- websocket-client: 1.3.3
- websockets: 10.3
- werkzeug: 2.2.2
- wheel: 0.37.1
- wrapt: 1.14.1
- xxhash: 3.0.0
- yarl: 1.8.1
- zipp: 3.8.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020
Building from the base docker image : mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04
@jessecambon Thanks for investigating this. Unfortunate that the edge case is single-node behavior. Does your Azure OpenMPI environment get selected correctly when running on single node too?
@awaelchli could we port this to lite?
We added an environment to handle MPI here: #16570. The Azure version could rely on this as well.
⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.
Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.
🔎 Detected hardcoded secrets in your pull request
| GitGuardian id | Secret | Commit | Filename | |
|---|---|---|---|---|
| - | Generic High Entropy Secret | 78fa3afdfbf964c19b4b2d36b91560698aa83178 | tests/tests_app/utilities/test_login.py | View secret |
| - | Base64 Basic Authentication | 78fa3afdfbf964c19b4b2d36b91560698aa83178 | tests/tests_app/utilities/test_login.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Our GitHub checks need improvements? Share your feedbacks!