TP SP examples improvement
Changing cuda to accelerator, adding ConmDebugMode to tensor_parallel_example.py, sequence_parallel_example.py, and log_utils.py .
Deploy Preview for pytorch-examples-preview canceled.
| Name | Link |
|---|---|
| Latest commit | f9bcd0865f3b76bc8ff59be8fca6d70fe4360c6f |
| Latest deploy log | https://app.netlify.com/projects/pytorch-examples-preview/deploys/686ea4efcaf7fb0008066819 |
Looks like the failing cuda test below ( [Run Distributed Examples / test (pull_request) is done with a relatively old version of PyTorch ( torch==2.4.0.dev20240605+cu11 ). The upcoming release is 2,8 .
@msaroufim , is it possible to update the PyTorch version in CI to 2.8 ?
is it possible to update the PyTorch version in CI to 2.8 ?
@githubsgi, we were updating pytorch version in other samples to be able to use accelerator API. It was added in 2.6. However, that's not a change in CI which is needed, but in requirements.txt file of the particular sample. In your case torch==2.4.0.dev20240605+cu118 is getting installed due to these definitions in that file:
https://github.com/pytorch/examples/blob/d16c819b9681a561bbed867ebb7c013447a3e04d/distributed/tensor_parallelism/requirements.txt#L3-L6
I suggest to consider dropping --pre and --extra-index-url arguments and just specifying required torch version as it's unlikely this example requires something specificaly from the specified release channels. Please, correct me if I am wrong here.
@msaroufim , anything more I need to do ?
@msaroufim , the "Run Distributed Examples" check is failing due to the following .
Running example: distributed/ddp
/home/runner/work/examples/examples/distributed/ddp/.venv/lib/python3.8/site-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
Requires at least 8 GPUs to run, but got 1.
Some distributed examples failed:
distributed/tensor_parallelism: failed to install requirements
Error: Process completed with exit code 1.
Also looks like Python 3.8 is forced per this - https://github.com/pytorch/examples/blob/main/runtime.txt . This version is not in the recommended list per this - https://pytorch.org/get-started/locally/
Can a maintainer please look at "3 workflows awaiting approval" ?
Hello maintainer ( @msaroufim ?) , can you please look into "3 workflows awaiting approval " ?
any chance we can switch to use a runner with few GPUs?
No it's unlikely we'll get a runner with many GPUs on pytorch/examples. I'd suggest just ensuring CI passes and attaching logs as evidence that a fix works as expected
any chance we can switch to use a runner with few GPUs?
No it's unlikely we'll get a runner with many GPUs on pytorch/examples. I'd suggest just ensuring CI passes and attaching logs as evidence that a fix works as expected
The last error appears to be as below - NumPy missing .
/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
@githubsgi : numpy one is a warning, not an error. Error is above:
Traceback (most recent call last):
File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/bin/torchrun", line 4, in <module>
from torch.distributed.run import main
File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/run.py", line 381, in <module>
from torch.distributed.elastic.rendezvous.utils import _parse_rendezvous_config
File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/__init__.py", line 142, in <module>
from .registry import _register_default_handlers, _register_out_of_tree_handlers
File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 19, in <module>
from importlib_metadata import entry_points
ModuleNotFoundError: No module named 'importlib_metadata'
@githubsgi :
numpyone is a warning, not an error. Error is above:Traceback (most recent call last): File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/bin/torchrun", line 4, in <module> from torch.distributed.run import main File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/run.py", line 381, in <module> from torch.distributed.elastic.rendezvous.utils import _parse_rendezvous_config File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/__init__.py", line 142, in <module> from .registry import _register_default_handlers, _register_out_of_tree_handlers File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 19, in <module> from importlib_metadata import entry_points ModuleNotFoundError: No module named 'importlib_metadata'
That too. It is probably missing many other packages. Did this CI ever work ?
Yes, it did pass on prev. python version. Python in general is not a bullet proof solution when you consider a project on multiple python versions. Some dependencies might be needed in one version and not needed in another. This seems the case here: https://stackoverflow.com/questions/73165636/no-module-named-importlib-metadata. Let's try to solve this on Monday. One thing to consider is to fully align distributed workflow with non-distributed one which is known to pass and actually is doing something with pytorch rather than just installs dependencies.
@githubsgi, I tried locally. With python 3.9 you additionally need to install importlib-metadata package. However, since non-distributed examples are using python 3.10 already, it makes sense to just update distributed workflow to use 3.10 as well. With this python version you don't need to explicitly install this package.
@msaroufim , could we try the workflow one more time ?
@msaroufim , whenever you get a chance, please kickoff another another workflow.
@msaroufim , looks like all CI tests passed. Please merge it when you get a chance.