bevfusion icon indicating copy to clipboard operation
bevfusion copied to clipboard

evaluation encounters mpirun: not found

Open luoyuchenmlcv opened this issue 2 years ago • 38 comments

Dear Haotian, Thanks for your great work!

I am trying to evaluate the detection model on a single GPU machine by running the following in terminal:

torchpack dist-run -np 1 python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me

/bin/sh: 1: mpirun: not found

when I am running

torchpack dist-run -np

it gives me

torchpack dist-run: error: argument -np/--nproc: expected one argument

when I am running

torchpack dist-run -np 1

it gives me

/bin/sh: 1: mpirun: not found

I am not sure what could be the problem, could that be due to the single GPU machine, or something else?

A lot of thanks for your kindly reply :)

luoyuchenmlcv avatar Jun 13 '22 13:06 luoyuchenmlcv

Hi @luoyuchenmlcv,

I think torchpack assumes OpenMPI to be existent on your machine and you will have to install the Python mpirun/mpi4py package even if you have only 1 GPU. @zhijian-liu Zhijian can probably provide more details about the distributed environment.

Best, Haotian

kentang-mit avatar Jun 13 '22 14:06 kentang-mit

Thanks a lot Haotian, the mpirun not found issue is solved, however after successfully installing mpich and mpi4py, running the evaluation command gives another error:

[mpiexec@container-c6d211be3c-365c285e] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root [mpiexec@container-c6d211be3c-365c285e] HYDU_parse_array (utils/args/args.c:174): argument matching returned error [mpiexec@container-c6d211be3c-365c285e] parse_args (ui/mpich/utils.c:1597): error parsing input array [mpiexec@container-c6d211be3c-365c285e] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1649): unable to parse user arguments [mpiexec@container-c6d211be3c-365c285e] main (ui/mpich/mpiexec.c:149): error parsing parameters

And mpirun version looks like this: image

luoyuchenmlcv avatar Jun 14 '22 03:06 luoyuchenmlcv

I remember that I met with similar problem before with openmpi=3.x, and I just checked our machine, it has openmpi =4.1.1. My mpi4py version is 3.0.3. I'm not sure whether upgrading openmpi to 4.x will help solve this issue. Maybe @zhijian-liu can provide more insights if that does not work.

kentang-mit avatar Jun 14 '22 04:06 kentang-mit

Thanks for pointing out the folder issue, now I think I have changed to the correct folder

The folder structure is like this: image

but this error still exist image

luoyuchenmlcv avatar Jun 15 '22 13:06 luoyuchenmlcv

Hi @luoyuchenmlcv,

Are you in the correct folder? It seems to me that you are currently in the openmpi folder, instead of the project folder?

Best, Haotian

kentang-mit avatar Jun 15 '22 13:06 kentang-mit

When I am running inside the project folder the error exists: image

When I give them absolute path: image

luoyuchenmlcv avatar Jun 15 '22 14:06 luoyuchenmlcv

Is it possible not to run torchpack in the root mode or in a container?

kentang-mit avatar Jun 15 '22 15:06 kentang-mit

Install openmpi not mpich can solve the problem

songlilucky avatar Jun 15 '22 15:06 songlilucky

if your machine do not have openmpi, you can also use other dist-run method to solve this problem

Install openmpi not mpich can solve the problem

songlilucky avatar Jun 15 '22 16:06 songlilucky

Thanks for the comment @songlilucky!

kentang-mit avatar Jun 15 '22 16:06 kentang-mit

if your machine do not have openmpi, you can also use other dist-run method to solve this problem

Install openmpi not mpich can solve the problem

Thanks for your suggestion, as suggested by Haotian, openmpi-4.1.1 has already been installed, the error seems somewhere else.

luoyuchenmlcv avatar Jun 16 '22 04:06 luoyuchenmlcv

Is it possible not to run torchpack in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me error like this, but I suspect this might be due to not running torchpack? image

but mpirun seems working when running a hello_c.c file under root: image

luoyuchenmlcv avatar Jun 16 '22 04:06 luoyuchenmlcv

Hi @luoyuchenmlcv, I mean you should still use torchpack to launch the script, but instead of running in a docker and as a root user, I would recommend you to run the script directly in the host mode, and not as a root user. Hope that is clear enough.

kentang-mit avatar Jun 16 '22 04:06 kentang-mit

Hi @luoyuchenmlcv, I mean you should still use torchpack to launch the script, but instead of running in a docker and as a root user, I would recommend you to run the script directly in the host mode, and not as a root user. Hope that is clear enough.

Thanks for your suggestion, now I am only able to run code in cloud service, I add a new user to linux system to run as the host mode seems difficult, I have to install python, conda env and everything else.

I guess the problem might be torchpack, is there any standard test to see if torchpack can work?

In addition, when running the evaluation cmd, there's a suspicous warning: Warning: could not find environment variable "-x" where does this come from?

luoyuchenmlcv avatar Jun 16 '22 05:06 luoyuchenmlcv

Hello Haotian, Finally solved. The issue is that in the virtual machine there's an empty env variable name that affects mpirun parsing.

After finishing the evaluation, the detection results are as follows, seem little bit lower than the test result: image

image

The eval log: RTX A5000 24GB.txt

My env image

luoyuchenmlcv avatar Jun 16 '22 09:06 luoyuchenmlcv

@luoyuchenmlcv I have the same problem. Could you tell me how to solve this? Thanks a lot!

cps80 avatar Jun 16 '22 11:06 cps80

@luoyuchenmlcv I have the same problem. Could you tell me how to solve this? Thanks a lot!

What exactly is your issue?

luoyuchenmlcv avatar Jun 16 '22 11:06 luoyuchenmlcv

image

cps80 avatar Jun 16 '22 11:06 cps80

@luoyuchenmlcv this is my issue.

cps80 avatar Jun 16 '22 11:06 cps80

@luoyuchenmlcv this is my issue. I guess you are using mpich instead of openmpi

please follow this link to install openmpi: https://sites.google.com/site/rangsiman1993/comp-env/program-install/install-openmpi

after finishing installation of openmpi,

sudo ldconfig

then

pip install mpi4py

then try running eval command to see if problem solved.

luoyuchenmlcv avatar Jun 16 '22 11:06 luoyuchenmlcv

Hello Haotian, Finally solved. The issue is that in the virtual machine there's an empty env variable name that affects mpirun parsing.

After finishing the evaluation, the detection results are as follows, seem little bit lower than the test result: image

image

The eval log: RTX A5000 24GB.txt

My env image

Hi @luoyuchenmlcv, your evaluation results are actually a bit higher than what I get on the validation set (68.39 mAP and 71.32 NDS). The reported results in your screenshot are on the test split.

kentang-mit avatar Jun 16 '22 12:06 kentang-mit

image

I would suggest that you follow the advice of @luoyuchenmlcv to install openmpi 4.1.1 and mpi4py 3.0.3. After that you can try out the same command again.

kentang-mit avatar Jun 16 '22 12:06 kentang-mit

@luoyuchenmlcv @kentang-mit Thanks a lot!

cps80 avatar Jun 17 '22 08:06 cps80

You are welcome @PeisongCheng. I'll leave this issue open for discussions on MPI.

kentang-mit avatar Jun 17 '22 16:06 kentang-mit

Is it possible not to run torchpack in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me error like this, but I suspect this might be due to not running torchpack? image

but mpirun seems working when running a hello_c.c file under root: image

Hi, I have the same problem. Could you tell me how to solve this? Thanks a lot! image

tangtaogo avatar Jun 18 '22 05:06 tangtaogo

Is it possible not to run torchpack in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me error like this, but I suspect this might be due to not running torchpack? image but mpirun seems working when running a hello_c.c file under root: image

Hi, I have the same problem. Could you tell me how to solve this? Thanks a lot! image

Hi, this KeyError is due to not running torchpack, if read the torchpack source code , MASTER_HOST system env variable is added, so you must run torchpack

luoyuchenmlcv avatar Jun 18 '22 11:06 luoyuchenmlcv

Hi @Trent-tangtao, I believe @luoyuchenmlcv is correct, you will need to launch the script using torchpack.

kentang-mit avatar Jun 18 '22 15:06 kentang-mit

Hi @luoyuchenmlcv @kentang-mit, I have a problem about where to place the data which are generated by the 'create_data.py' file. When the data in the directory of mmdet3d, I tried to test. image I occurred this error. image So I change the directory of data. I placed the data in the directory of bevfusion. image I occurred this error. image

I think you maybe just placed the data and the previous pkl files in the directory of bevfusion. You need to regenerate new pkl files or try "ln -s ".

tangtaogo avatar Jun 29 '22 09:06 tangtaogo

Hi @luoyuchenmlcv @kentang-mit, I have a problem about where to place the data which are generated by the 'create_data.py' file. When the data in the directory of mmdet3d, I tried to test. image I occurred this error. image So I change the directory of data. I placed the data in the directory of bevfusion. image I occurred this error. image

I think you maybe just placed the data and the previous pkl files in the directory of bevfusion. You need to regenerate new pkl files or try "ln -s ".

Thanks for your suggestions @Trent-tangtao I solved this issues.

hongjiacheng1014 avatar Jun 29 '22 12:06 hongjiacheng1014

Hi @kentang-mit I am facing the same issue "there are not enough slots....". My env: I am running the command on the cluster using slurm. When I specify the command as torchpack dist-run -np 1 python tools/test.py and use only one GPU on cluster it works fine. But when I want to utilize more gpus i.e 2 or more: torchpack dist-run -np 4 python tools/test.py by allocating 4 GPUs using slurm command it results in the same error.

As it is working with 1 GPU, I think there is no problem with mpi or openmp (also I am on the cluster). Could you please help me with how can I utilize more GPUs to speed up the process?

IAMShashankk avatar Jul 14 '22 21:07 IAMShashankk