pytorch Torchrun seems to have problem with virtual environment

Issue description

Use torchrun (inside a virtual environment) to launch a Python script. The script can not import modules installed in that virtual environment. Changing to use torch.distributed.launch to launch works well but that method has been depreciated.

Code example

Create a venv, activate the venv
Install a module, say "pip install datasets"
Write a python script (test.py) to load datasets

# This is test.py
import datasets

Do the following

torchrun \
    --nnodes=1 \
    --nproc_per_node=8 \
    --max_restarts=0 \
    --rdzv_id=123456 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=localhost \
    test.py

You will get "ModuleNotFoundError: No module named 'datasets'"

Changing from 'torchrun' to 'python3 -m torch.distributed.launch', then it all works.

python3 -m torch.distributed.launch \
    --standalone \
    --nnodes=1 \
    --nproc_per_node=8 \
    test.py

System Info

Python 3.10.6 PyTorch 1.12.0 Linux (CentOS 7)

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

Jan 13 '23 04:01 c0redumb

@c0redumb can you check where torch/torchrun was installed? If it's not installed in the virtual env but instead installed at the user or system level it can end up searching in the wrong location.

Jan 13 '23 23:01 d4l3k

@d4l3k, thanks for your response. You are correct that on my system torch/torchrun is loaded through "module load". They are not in the virtual env. When the virtual env is activated, I would expect that Python searches in virtual env first, then at the system locations. It seems to work that way as expected when I run python3 test.py directly without a problem.

Further more when you consider that python3 -m torch.distributed.launch works, and this torch.distributed is also at the system location (same as the torchrun) and not in my virtual env. There must be something that torchrun does differently from torch.distributed.launch.

Have you been able to replicate this on your side?

Jan 13 '23 23:01 c0redumb

@c0redumb if you install torch in your virtual env does it solve the problem?

Also can you try running which torchrun to identify where it's coming from?

If it's not installed in the local venv it doesn't use the venv's python3 binary

Ex: on my system with it installed at the user level the torchrun dummy file explicitly specifies /usr/bin/python

$ cat /home/tristanr/.local/bin/torchrun
#!/usr/bin/python
...

We doesn't do anything special here AFAIK this is standard python behavior

Jan 13 '23 23:01 d4l3k

On my system they (both torchrun and python) point to the same system level installation.

which torchrun points to /public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/bin/torchrun. Python in virtual env is linked to /public/apps/python/3.10.6/gcc.7.3.1/base/bin/python.

I am still confused as why torchrun and python3 -m torch.distributed.launch has different behavior. They are both from the same installation. I am not convinced that this is as you said a standard Python behavior, because apparently python3 -m torch.distributed.launch test.py behaved differently from that, and python test.py behaved differently from that. The later two both work on importing modules from virtual env as expected.

I am confused about is the "standard python behavior" you are referring to. Installing Python at system level, create a personal venv, install a module in venv, I would expect "python test.py" to import that module (when venv is activated) which is the idea of venv. python test.py works but torchrun test.py does not. That looks wrong, right? Maybe I am missing something?

Jan 14 '23 00:01 c0redumb

I'm having the same issue... Were you able to figure out a solution?

Apr 23 '23 20:04 lilakk

Now I run most of my workload in Docker containers to avoid virtualenv entirely. I haven't check to see if it runs in virtualenv properly lately. But there seems to be other issues with torch.distributed from time to time that I have to work around. Overall it is still not quite stable yet.

Apr 23 '23 20:04 c0redumb

Oh I see, thanks for the quick reply!

Apr 23 '23 20:04 lilakk

Changing the default python path in the torchrun file to the virtualenv's python path works fine

#!/path/to/your/env/bin/python

# -*- coding: utf-8 -*-
import re
import sys
from torch.distributed.run import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

May 06 '23 07:05 usaradman

Can it be fixed in the future version?

Oct 06 '23 14:10 MrPeterJin

Similar to this issue, which also offers some work around. It seems we would have to live with problem for a while, but acceptable😂

Mar 18 '24 07:03 DeclK

pytorch pytorch copied to clipboard

Torchrun seems to have problem with virtual environment

Issue description

Code example

System Info

pytorch
pytorch copied to clipboard