pytorch
pytorch copied to clipboard
Torchrun seems to have problem with virtual environment
Issue description
Use torchrun (inside a virtual environment) to launch a Python script. The script can not import modules installed in that virtual environment. Changing to use torch.distributed.launch to launch works well but that method has been depreciated.
Code example
- Create a venv, activate the venv
- Install a module, say "pip install datasets"
- Write a python script (test.py) to load datasets
# This is test.py
import datasets
- Do the following
torchrun \
--nnodes=1 \
--nproc_per_node=8 \
--max_restarts=0 \
--rdzv_id=123456 \
--rdzv_backend=c10d \
--rdzv_endpoint=localhost \
test.py
You will get "ModuleNotFoundError: No module named 'datasets'"
- Changing from 'torchrun' to 'python3 -m torch.distributed.launch', then it all works.
python3 -m torch.distributed.launch \
--standalone \
--nnodes=1 \
--nproc_per_node=8 \
test.py
System Info
Python 3.10.6 PyTorch 1.12.0 Linux (CentOS 7)
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
@c0redumb can you check where torch/torchrun was installed? If it's not installed in the virtual env but instead installed at the user or system level it can end up searching in the wrong location.
@d4l3k, thanks for your response. You are correct that on my system torch/torchrun is loaded through "module load". They are not in the virtual env. When the virtual env is activated, I would expect that Python searches in virtual env first, then at the system locations. It seems to work that way as expected when I run python3 test.py directly without a problem.
Further more when you consider that python3 -m torch.distributed.launch works, and this torch.distributed is also at the system location (same as the torchrun) and not in my virtual env. There must be something that torchrun does differently from torch.distributed.launch.
Have you been able to replicate this on your side?
@c0redumb if you install torch in your virtual env does it solve the problem?
Also can you try running which torchrun to identify where it's coming from?
If it's not installed in the local venv it doesn't use the venv's python3 binary
Ex: on my system with it installed at the user level the torchrun dummy file explicitly specifies /usr/bin/python
$ cat /home/tristanr/.local/bin/torchrun
#!/usr/bin/python
...
We doesn't do anything special here AFAIK this is standard python behavior
On my system they (both torchrun and python) point to the same system level installation.
which torchrun points to /public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/bin/torchrun.
Python in virtual env is linked to /public/apps/python/3.10.6/gcc.7.3.1/base/bin/python.
I am still confused as why torchrun and python3 -m torch.distributed.launch has different behavior. They are both from the same installation. I am not convinced that this is as you said a standard Python behavior, because apparently python3 -m torch.distributed.launch test.py behaved differently from that, and python test.py behaved differently from that. The later two both work on importing modules from virtual env as expected.
I am confused about is the "standard python behavior" you are referring to. Installing Python at system level, create a personal venv, install a module in venv, I would expect "python test.py" to import that module (when venv is activated) which is the idea of venv. python test.py works but torchrun test.py does not. That looks wrong, right? Maybe I am missing something?
I'm having the same issue... Were you able to figure out a solution?
Now I run most of my workload in Docker containers to avoid virtualenv entirely. I haven't check to see if it runs in virtualenv properly lately. But there seems to be other issues with torch.distributed from time to time that I have to work around. Overall it is still not quite stable yet.
Oh I see, thanks for the quick reply!
Changing the default python path in the torchrun file to the virtualenv's python path works fine
#!/path/to/your/env/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from torch.distributed.run import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(main())
Can it be fixed in the future version?
Similar to this issue, which also offers some work around. It seems we would have to live with problem for a while, but acceptable😂