torchx icon indicating copy to clipboard operation
torchx copied to clipboard

Couldn't get torchx to work with python 3.12

Open henrylhtsang opened this issue 1 year ago • 3 comments

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation.

Question

Hi, I am trying to run hello world with torchx. I have hello_world.py, which just prints "hello".

Then I run

conda create -n debug_torchx python=3.12 -y
conda activate debug_torchx

pip install torch --index-url https://download.pytorch.org/whl/nightly/cu118
pip install numpy --index-url https://download.pytorch.org/whl/nightly/cu118

pip install torchx-nightly
torchx run -s local_cwd dist.ddp -j 1 --gpu 2 --script hello_world.py 

Then I got the following error:

$ torchx run -s local_cwd dist.ddp -j 1 --gpu 2 --script hello_world.py 
torchx 2023-12-09 06:08:53 INFO     Tracker configurations: {}
torchx 2023-12-09 06:08:53 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2023-12-09 06:08:53 INFO     Log directory is: /tmp/torchx_kv0nzij6
torchx 2023-12-09 06:08:53 WARNING  

======================================================================
Running multiple role replicas that require GPUs without
setting `CUDA_VISIBLE_DEVICES` may result in multiple
processes using the same GPU device with undesired consequences
such as CUDA OutOfMemory errors.

To have TorchX set `CUDA_VISIBLE_DEVICES` to divide the
available GPUs on this host equally among the role replicas
set the `auto_set_cuda_visible_devices = True` scheduler runopt
======================================================================
                            
local_cwd://torchx/hello_world-ddzkbddsjc096c
torchx 2023-12-09 06:08:53 INFO     Waiting for the app to finish...
hello_world/0 Fatal Python error: Segmentation fault
hello_world/0 
hello_world/0 Current thread 0x00007f5d7da73600 (most recent call first):
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258 in create_handler
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
hello_world/0   File "/home/henrylhtsang/miniconda/envs/debug_torchx/bin/torchrun", line 8 in <module>
hello_world/0 
hello_world/0 Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
torchx 2023-12-09 06:08:56 INFO     Job finished: FAILED
torchx 2023-12-09 06:08:56 ERROR    AppStatus:
  msg: <NONE>
  num_restarts: 0
  roles: []
  state: FAILED (5)
  structured_error_msg: <NONE>
  ui_url: file:///tmp/torchx_kv0nzij6/torchx/hello_world-ddzkbddsjc096c

Any idea?

henrylhtsang avatar Dec 09 '23 06:12 henrylhtsang

Could you paste your hello_world.py here?

Also does the problem repro if you run a simple script as below (without torchx):

import torch

print(torch.__version__)

save the above as test.py and run python test.py

kiukchung avatar Dec 10 '23 18:12 kiukchung

@kiukchung Thanks for the quick reply!

cat hello_world.py

$ cat hello_world.py 
print("Hello, TorchX!")

cat test2.py

$ cat test2.py 
import torch

print(torch.__version__)

running test2.py

$ python test2.py 
2.2.0.dev20231208+cu118

python version

$ python --version
Python 3.12.0

EDIT: typo

henrylhtsang avatar Dec 10 '23 20:12 henrylhtsang

fyi https://github.com/pytorch/pytorch/issues/116423

henrylhtsang avatar Jan 02 '24 17:01 henrylhtsang