torchx
torchx copied to clipboard
Couldn't get torchx to work with python 3.12
❓ Questions and Help
Please note that this issue tracker is not a help form and this issue will be closed.
Before submitting, please ensure you have gone through our documentation.
Question
Hi, I am trying to run hello world with torchx. I have hello_world.py
, which just prints "hello".
Then I run
conda create -n debug_torchx python=3.12 -y
conda activate debug_torchx
pip install torch --index-url https://download.pytorch.org/whl/nightly/cu118
pip install numpy --index-url https://download.pytorch.org/whl/nightly/cu118
pip install torchx-nightly
torchx run -s local_cwd dist.ddp -j 1 --gpu 2 --script hello_world.py
Then I got the following error:
$ torchx run -s local_cwd dist.ddp -j 1 --gpu 2 --script hello_world.py
torchx 2023-12-09 06:08:53 INFO Tracker configurations: {}
torchx 2023-12-09 06:08:53 INFO Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2023-12-09 06:08:53 INFO Log directory is: /tmp/torchx_kv0nzij6
torchx 2023-12-09 06:08:53 WARNING
======================================================================
Running multiple role replicas that require GPUs without
setting `CUDA_VISIBLE_DEVICES` may result in multiple
processes using the same GPU device with undesired consequences
such as CUDA OutOfMemory errors.
To have TorchX set `CUDA_VISIBLE_DEVICES` to divide the
available GPUs on this host equally among the role replicas
set the `auto_set_cuda_visible_devices = True` scheduler runopt
======================================================================
local_cwd://torchx/hello_world-ddzkbddsjc096c
torchx 2023-12-09 06:08:53 INFO Waiting for the app to finish...
hello_world/0 Fatal Python error: Segmentation fault
hello_world/0
hello_world/0 Current thread 0x00007f5d7da73600 (most recent call first):
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258 in create_handler
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
hello_world/0 File "/home/henrylhtsang/miniconda/envs/debug_torchx/bin/torchrun", line 8 in <module>
hello_world/0
hello_world/0 Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
torchx 2023-12-09 06:08:56 INFO Job finished: FAILED
torchx 2023-12-09 06:08:56 ERROR AppStatus:
msg: <NONE>
num_restarts: 0
roles: []
state: FAILED (5)
structured_error_msg: <NONE>
ui_url: file:///tmp/torchx_kv0nzij6/torchx/hello_world-ddzkbddsjc096c
Any idea?
Could you paste your hello_world.py here?
Also does the problem repro if you run a simple script as below (without torchx):
import torch
print(torch.__version__)
save the above as test.py and run python test.py
@kiukchung Thanks for the quick reply!
cat hello_world.py
$ cat hello_world.py
print("Hello, TorchX!")
cat test2.py
$ cat test2.py
import torch
print(torch.__version__)
running test2.py
$ python test2.py
2.2.0.dev20231208+cu118
python version
$ python --version
Python 3.12.0
EDIT: typo
fyi https://github.com/pytorch/pytorch/issues/116423