bevfusion evaluation encounters mpirun: not found

Dear Haotian, Thanks for your great work!

I am trying to evaluate the detection model on a single GPU machine by running the following in terminal:

torchpack dist-run -np 1 python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me

/bin/sh: 1: mpirun: not found

when I am running

torchpack dist-run -np

it gives me

torchpack dist-run: error: argument -np/--nproc: expected one argument

when I am running

torchpack dist-run -np 1

it gives me

/bin/sh: 1: mpirun: not found

I am not sure what could be the problem, could that be due to the single GPU machine, or something else?

A lot of thanks for your kindly reply :)

Jun 13 '22 13:06 luoyuchenmlcv

Hi @luoyuchenmlcv,

I think torchpack assumes OpenMPI to be existent on your machine and you will have to install the Python mpirun/mpi4py package even if you have only 1 GPU. @zhijian-liu Zhijian can probably provide more details about the distributed environment.

Best, Haotian

Jun 13 '22 14:06 kentang-mit

Thanks a lot Haotian, the mpirun not found issue is solved, however after successfully installing mpich and mpi4py, running the evaluation command gives another error:

[mpiexec@container-c6d211be3c-365c285e] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root [mpiexec@container-c6d211be3c-365c285e] HYDU_parse_array (utils/args/args.c:174): argument matching returned error [mpiexec@container-c6d211be3c-365c285e] parse_args (ui/mpich/utils.c:1597): error parsing input array [mpiexec@container-c6d211be3c-365c285e] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1649): unable to parse user arguments [mpiexec@container-c6d211be3c-365c285e] main (ui/mpich/mpiexec.c:149): error parsing parameters

And mpirun version looks like this:

Jun 14 '22 03:06 luoyuchenmlcv

I remember that I met with similar problem before with openmpi=3.x, and I just checked our machine, it has openmpi =4.1.1. My mpi4py version is 3.0.3. I'm not sure whether upgrading openmpi to 4.x will help solve this issue. Maybe @zhijian-liu can provide more insights if that does not work.

Jun 14 '22 04:06 kentang-mit

Thanks for pointing out the folder issue, now I think I have changed to the correct folder

The folder structure is like this:

but this error still exist

Jun 15 '22 13:06 luoyuchenmlcv

Hi @luoyuchenmlcv,

Are you in the correct folder? It seems to me that you are currently in the openmpi folder, instead of the project folder?

Best, Haotian

Jun 15 '22 13:06 kentang-mit

When I am running inside the project folder the error exists:

When I give them absolute path:

Jun 15 '22 14:06 luoyuchenmlcv

Is it possible not to run torchpack in the root mode or in a container?

Jun 15 '22 15:06 kentang-mit

Install openmpi not mpich can solve the problem

Jun 15 '22 15:06 songlilucky

if your machine do not have openmpi, you can also use other dist-run method to solve this problem

Install openmpi not mpich can solve the problem

Jun 15 '22 16:06 songlilucky

Thanks for the comment @songlilucky!

Jun 15 '22 16:06 kentang-mit

if your machine do not have openmpi, you can also use other dist-run method to solve this problem

Install openmpi not mpich can solve the problem

Thanks for your suggestion, as suggested by Haotian, openmpi-4.1.1 has already been installed, the error seems somewhere else.

Jun 16 '22 04:06 luoyuchenmlcv

Is it possible not to run torchpack in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me error like this, but I suspect this might be due to not running torchpack?

but mpirun seems working when running a hello_c.c file under root:

Jun 16 '22 04:06 luoyuchenmlcv

Hi @luoyuchenmlcv, I mean you should still use torchpack to launch the script, but instead of running in a docker and as a root user, I would recommend you to run the script directly in the host mode, and not as a root user. Hope that is clear enough.

Jun 16 '22 04:06 kentang-mit

Hi @luoyuchenmlcv, I mean you should still use torchpack to launch the script, but instead of running in a docker and as a root user, I would recommend you to run the script directly in the host mode, and not as a root user. Hope that is clear enough.

Thanks for your suggestion, now I am only able to run code in cloud service, I add a new user to linux system to run as the host mode seems difficult, I have to install python, conda env and everything else.

I guess the problem might be torchpack, is there any standard test to see if torchpack can work?

In addition, when running the evaluation cmd, there's a suspicous warning: Warning: could not find environment variable "-x" where does this come from?

Jun 16 '22 05:06 luoyuchenmlcv

Hello Haotian, Finally solved. The issue is that in the virtual machine there's an empty env variable name that affects mpirun parsing.

After finishing the evaluation, the detection results are as follows, seem little bit lower than the test result:

The eval log: RTX A5000 24GB.txt

My env

Jun 16 '22 09:06 luoyuchenmlcv

@luoyuchenmlcv I have the same problem. Could you tell me how to solve this? Thanks a lot!

Jun 16 '22 11:06 cps80

@luoyuchenmlcv I have the same problem. Could you tell me how to solve this? Thanks a lot!

What exactly is your issue?

Jun 16 '22 11:06 luoyuchenmlcv

Jun 16 '22 11:06 cps80

@luoyuchenmlcv this is my issue.

Jun 16 '22 11:06 cps80

@luoyuchenmlcv this is my issue. I guess you are using mpich instead of openmpi

please follow this link to install openmpi: https://sites.google.com/site/rangsiman1993/comp-env/program-install/install-openmpi

after finishing installation of openmpi,

sudo ldconfig

then

pip install mpi4py

then try running eval command to see if problem solved.

Jun 16 '22 11:06 luoyuchenmlcv

Hello Haotian, Finally solved. The issue is that in the virtual machine there's an empty env variable name that affects mpirun parsing.

After finishing the evaluation, the detection results are as follows, seem little bit lower than the test result:

The eval log: RTX A5000 24GB.txt

My env

Hi @luoyuchenmlcv, your evaluation results are actually a bit higher than what I get on the validation set (68.39 mAP and 71.32 NDS). The reported results in your screenshot are on the test split.

Jun 16 '22 12:06 kentang-mit

I would suggest that you follow the advice of @luoyuchenmlcv to install openmpi 4.1.1 and mpi4py 3.0.3. After that you can try out the same command again.

Jun 16 '22 12:06 kentang-mit

@luoyuchenmlcv @kentang-mit Thanks a lot!

Jun 17 '22 08:06 cps80

You are welcome @PeisongCheng. I'll leave this issue open for discussions on MPI.

Jun 17 '22 16:06 kentang-mit

Is it possible not to run torchpack in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me error like this, but I suspect this might be due to not running torchpack?

but mpirun seems working when running a hello_c.c file under root:

Hi, I have the same problem. Could you tell me how to solve this? Thanks a lot!

Jun 18 '22 05:06 tangtaogo

Is it possible not to run torchpack in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox

it gives me error like this, but I suspect this might be due to not running torchpack? but mpirun seems working when running a hello_c.c file under root:

Hi, I have the same problem. Could you tell me how to solve this? Thanks a lot!

Hi, this KeyError is due to not running torchpack, if read the torchpack source code , MASTER_HOST system env variable is added, so you must run torchpack

Jun 18 '22 11:06 luoyuchenmlcv

Hi @Trent-tangtao, I believe @luoyuchenmlcv is correct, you will need to launch the script using torchpack.

Jun 18 '22 15:06 kentang-mit

Hi @luoyuchenmlcv @kentang-mit, I have a problem about where to place the data which are generated by the 'create_data.py' file. When the data in the directory of mmdet3d, I tried to test. I occurred this error. So I change the directory of data. I placed the data in the directory of bevfusion. I occurred this error.

I think you maybe just placed the data and the previous pkl files in the directory of bevfusion. You need to regenerate new pkl files or try "ln -s ".

Jun 29 '22 09:06 tangtaogo

Hi @luoyuchenmlcv @kentang-mit, I have a problem about where to place the data which are generated by the 'create_data.py' file. When the data in the directory of mmdet3d, I tried to test. I occurred this error. So I change the directory of data. I placed the data in the directory of bevfusion. I occurred this error.

I think you maybe just placed the data and the previous pkl files in the directory of bevfusion. You need to regenerate new pkl files or try "ln -s ".

Thanks for your suggestions @Trent-tangtao I solved this issues.

Jun 29 '22 12:06 hongjiacheng1014

Hi @kentang-mit I am facing the same issue "there are not enough slots....". My env: I am running the command on the cluster using slurm. When I specify the command as torchpack dist-run -np 1 python tools/test.py and use only one GPU on cluster it works fine. But when I want to utilize more gpus i.e 2 or more: torchpack dist-run -np 4 python tools/test.py by allocating 4 GPUs using slurm command it results in the same error.

As it is working with 1 GPU, I think there is no problem with mpi or openmp (also I am on the cluster). Could you please help me with how can I utilize more GPUs to speed up the process?

Jul 14 '22 21:07 IAMShashankk

bevfusion bevfusion copied to clipboard

evaluation encounters mpirun: not found

bevfusion
bevfusion copied to clipboard