examples icon indicating copy to clipboard operation
examples copied to clipboard

TP SP examples improvement

Open githubsgi opened this issue 6 months ago • 5 comments

Changing cuda to accelerator, adding ConmDebugMode to tensor_parallel_example.py, sequence_parallel_example.py, and log_utils.py .

githubsgi avatar Jun 11 '25 23:06 githubsgi

Deploy Preview for pytorch-examples-preview canceled.

Name Link
Latest commit f9bcd0865f3b76bc8ff59be8fca6d70fe4360c6f
Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/686ea4efcaf7fb0008066819

netlify[bot] avatar Jun 11 '25 23:06 netlify[bot]

Looks like the failing cuda test below ( [Run Distributed Examples / test (pull_request) is done with a relatively old version of PyTorch ( torch==2.4.0.dev20240605+cu11 ). The upcoming release is 2,8 .

githubsgi avatar Jun 25 '25 06:06 githubsgi

@msaroufim , is it possible to update the PyTorch version in CI to 2.8 ?

githubsgi avatar Jun 26 '25 17:06 githubsgi

is it possible to update the PyTorch version in CI to 2.8 ?

@githubsgi, we were updating pytorch version in other samples to be able to use accelerator API. It was added in 2.6. However, that's not a change in CI which is needed, but in requirements.txt file of the particular sample. In your case torch==2.4.0.dev20240605+cu118 is getting installed due to these definitions in that file:

https://github.com/pytorch/examples/blob/d16c819b9681a561bbed867ebb7c013447a3e04d/distributed/tensor_parallelism/requirements.txt#L3-L6

I suggest to consider dropping --pre and --extra-index-url arguments and just specifying required torch version as it's unlikely this example requires something specificaly from the specified release channels. Please, correct me if I am wrong here.

dvrogozh avatar Jun 26 '25 18:06 dvrogozh

@msaroufim , anything more I need to do ?

githubsgi avatar Jun 27 '25 21:06 githubsgi

@msaroufim , the "Run Distributed Examples" check is failing due to the following .

Running example: distributed/ddp
/home/runner/work/examples/examples/distributed/ddp/.venv/lib/python3.8/site-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Requires at least 8 GPUs to run, but got 1.
Some distributed examples failed:
distributed/tensor_parallelism: failed to install requirements
Error: Process completed with exit code 1.

Also looks like Python 3.8 is forced per this - https://github.com/pytorch/examples/blob/main/runtime.txt . This version is not in the recommended list per this - https://pytorch.org/get-started/locally/

githubsgi avatar Jun 30 '25 17:06 githubsgi

Can a maintainer please look at "3 workflows awaiting approval" ?

githubsgi avatar Jul 01 '25 18:07 githubsgi

Hello maintainer ( @msaroufim ?) , can you please look into "3 workflows awaiting approval " ?

githubsgi avatar Jul 02 '25 21:07 githubsgi

any chance we can switch to use a runner with few GPUs?

No it's unlikely we'll get a runner with many GPUs on pytorch/examples. I'd suggest just ensuring CI passes and attaching logs as evidence that a fix works as expected

msaroufim avatar Jul 03 '25 07:07 msaroufim

any chance we can switch to use a runner with few GPUs?

No it's unlikely we'll get a runner with many GPUs on pytorch/examples. I'd suggest just ensuring CI passes and attaching logs as evidence that a fix works as expected

The last error appears to be as below - NumPy missing .


/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)

githubsgi avatar Jul 03 '25 22:07 githubsgi

@githubsgi : numpy one is a warning, not an error. Error is above:

Traceback (most recent call last):
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/bin/torchrun", line 4, in <module>
    from torch.distributed.run import main
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/run.py", line 381, in <module>
    from torch.distributed.elastic.rendezvous.utils import _parse_rendezvous_config
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/__init__.py", line 142, in <module>
    from .registry import _register_default_handlers, _register_out_of_tree_handlers
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 19, in <module>
    from importlib_metadata import entry_points
ModuleNotFoundError: No module named 'importlib_metadata'

dvrogozh avatar Jul 03 '25 22:07 dvrogozh

@githubsgi : numpy one is a warning, not an error. Error is above:

Traceback (most recent call last):
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/bin/torchrun", line 4, in <module>
    from torch.distributed.run import main
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/run.py", line 381, in <module>
    from torch.distributed.elastic.rendezvous.utils import _parse_rendezvous_config
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/__init__.py", line 142, in <module>
    from .registry import _register_default_handlers, _register_out_of_tree_handlers
  File "/home/runner/work/examples/examples/distributed/tensor_parallelism/.venv/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 19, in <module>
    from importlib_metadata import entry_points
ModuleNotFoundError: No module named 'importlib_metadata'

That too. It is probably missing many other packages. Did this CI ever work ?

githubsgi avatar Jul 03 '25 23:07 githubsgi

Yes, it did pass on prev. python version. Python in general is not a bullet proof solution when you consider a project on multiple python versions. Some dependencies might be needed in one version and not needed in another. This seems the case here: https://stackoverflow.com/questions/73165636/no-module-named-importlib-metadata. Let's try to solve this on Monday. One thing to consider is to fully align distributed workflow with non-distributed one which is known to pass and actually is doing something with pytorch rather than just installs dependencies.

dvrogozh avatar Jul 04 '25 17:07 dvrogozh

@githubsgi, I tried locally. With python 3.9 you additionally need to install importlib-metadata package. However, since non-distributed examples are using python 3.10 already, it makes sense to just update distributed workflow to use 3.10 as well. With this python version you don't need to explicitly install this package.

dvrogozh avatar Jul 07 '25 21:07 dvrogozh

@msaroufim , could we try the workflow one more time ?

githubsgi avatar Jul 08 '25 02:07 githubsgi

@msaroufim , whenever you get a chance, please kickoff another another workflow.

githubsgi avatar Jul 09 '25 21:07 githubsgi

@msaroufim , looks like all CI tests passed. Please merge it when you get a chance.

githubsgi avatar Jul 10 '25 06:07 githubsgi