bevfusion
bevfusion copied to clipboard
evaluation encounters mpirun: not found
Dear Haotian, Thanks for your great work!
I am trying to evaluate the detection model on a single GPU machine by running the following in terminal:
torchpack dist-run -np 1 python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox
it gives me
/bin/sh: 1: mpirun: not found
when I am running
torchpack dist-run -np
it gives me
torchpack dist-run: error: argument -np/--nproc: expected one argument
when I am running
torchpack dist-run -np 1
it gives me
/bin/sh: 1: mpirun: not found
I am not sure what could be the problem, could that be due to the single GPU machine, or something else?
A lot of thanks for your kindly reply :)
Hi @luoyuchenmlcv,
I think torchpack
assumes OpenMPI to be existent on your machine and you will have to install the Python mpirun
/mpi4py
package even if you have only 1 GPU. @zhijian-liu Zhijian can probably provide more details about the distributed environment.
Best, Haotian
Thanks a lot Haotian, the mpirun not found issue is solved, however after successfully installing mpich and mpi4py, running the evaluation command gives another error:
[mpiexec@container-c6d211be3c-365c285e] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root [mpiexec@container-c6d211be3c-365c285e] HYDU_parse_array (utils/args/args.c:174): argument matching returned error [mpiexec@container-c6d211be3c-365c285e] parse_args (ui/mpich/utils.c:1597): error parsing input array [mpiexec@container-c6d211be3c-365c285e] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1649): unable to parse user arguments [mpiexec@container-c6d211be3c-365c285e] main (ui/mpich/mpiexec.c:149): error parsing parameters
And mpirun version looks like this:
I remember that I met with similar problem before with openmpi=3.x, and I just checked our machine, it has openmpi =4.1.1. My mpi4py version is 3.0.3. I'm not sure whether upgrading openmpi to 4.x will help solve this issue. Maybe @zhijian-liu can provide more insights if that does not work.
Thanks for pointing out the folder issue, now I think I have changed to the correct folder
The folder structure is like this:
but this error still exist
Hi @luoyuchenmlcv,
Are you in the correct folder? It seems to me that you are currently in the openmpi folder, instead of the project folder?
Best, Haotian
When I am running inside the project folder the error exists:
When I give them absolute path:
Is it possible not to run torchpack
in the root mode or in a container?
Install openmpi not mpich can solve the problem
if your machine do not have openmpi
, you can also use other dist-run method to solve this problem
Install openmpi not mpich can solve the problem
Thanks for the comment @songlilucky!
if your machine do not have
openmpi
, you can also use other dist-run method to solve this problemInstall openmpi not mpich can solve the problem
Thanks for your suggestion, as suggested by Haotian, openmpi-4.1.1 has already been installed, the error seems somewhere else.
Is it possible not to run
torchpack
in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox
it gives me error like this, but I suspect this might be due to not running torchpack?
but mpirun seems working when running a hello_c.c file under root:
Hi @luoyuchenmlcv, I mean you should still use torchpack to launch the script, but instead of running in a docker and as a root user, I would recommend you to run the script directly in the host mode, and not as a root user. Hope that is clear enough.
Hi @luoyuchenmlcv, I mean you should still use torchpack to launch the script, but instead of running in a docker and as a root user, I would recommend you to run the script directly in the host mode, and not as a root user. Hope that is clear enough.
Thanks for your suggestion, now I am only able to run code in cloud service, I add a new user to linux system to run as the host mode seems difficult, I have to install python, conda env and everything else.
I guess the problem might be torchpack, is there any standard test to see if torchpack can work?
In addition, when running the evaluation cmd, there's a suspicous warning: Warning: could not find environment variable "-x" where does this come from?
Hello Haotian, Finally solved. The issue is that in the virtual machine there's an empty env variable name that affects mpirun parsing.
After finishing the evaluation, the detection results are as follows, seem little bit lower than the test result:
The eval log: RTX A5000 24GB.txt
My env
@luoyuchenmlcv I have the same problem. Could you tell me how to solve this? Thanks a lot!
@luoyuchenmlcv I have the same problem. Could you tell me how to solve this? Thanks a lot!
What exactly is your issue?
@luoyuchenmlcv this is my issue.
@luoyuchenmlcv this is my issue. I guess you are using mpich instead of openmpi
please follow this link to install openmpi: https://sites.google.com/site/rangsiman1993/comp-env/program-install/install-openmpi
after finishing installation of openmpi,
sudo ldconfig
then
pip install mpi4py
then try running eval command to see if problem solved.
Hello Haotian, Finally solved. The issue is that in the virtual machine there's an empty env variable name that affects mpirun parsing.
After finishing the evaluation, the detection results are as follows, seem little bit lower than the test result:
The eval log: RTX A5000 24GB.txt
My env
Hi @luoyuchenmlcv, your evaluation results are actually a bit higher than what I get on the validation set (68.39 mAP and 71.32 NDS). The reported results in your screenshot are on the test split.
I would suggest that you follow the advice of @luoyuchenmlcv to install openmpi 4.1.1 and mpi4py 3.0.3. After that you can try out the same command again.
@luoyuchenmlcv @kentang-mit Thanks a lot!
You are welcome @PeisongCheng. I'll leave this issue open for discussions on MPI.
Is it possible not to run
torchpack
in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bboxit gives me error like this, but I suspect this might be due to not running torchpack?
but mpirun seems working when running a hello_c.c file under root:
Hi,
I have the same problem. Could you tell me how to solve this? Thanks a lot!
Is it possible not to run
torchpack
in the root mode or in a container? I run the following code in the root folder: root@container-c6d211be3c-365c285e:/# python /root/autodl-tmp/bevfusion/tools/test.py /root/autodl-tmp/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bboxit gives me error like this, but I suspect this might be due to not running torchpack?
but mpirun seems working when running a hello_c.c file under root:
Hi, I have the same problem. Could you tell me how to solve this? Thanks a lot!
Hi, this KeyError is due to not running torchpack, if read the torchpack source code , MASTER_HOST system env variable is added, so you must run torchpack
Hi @Trent-tangtao, I believe @luoyuchenmlcv is correct, you will need to launch the script using torchpack.
Hi @luoyuchenmlcv @kentang-mit, I have a problem about where to place the data which are generated by the 'create_data.py' file. When the data in the directory of mmdet3d, I tried to test.
I occurred this error.
So I change the directory of data. I placed the data in the directory of bevfusion.
I occurred this error.
I think you maybe just placed the data and the previous pkl files in the directory of bevfusion. You need to regenerate new pkl files or try "ln -s ".
Hi @luoyuchenmlcv @kentang-mit, I have a problem about where to place the data which are generated by the 'create_data.py' file. When the data in the directory of mmdet3d, I tried to test.
I occurred this error.
So I change the directory of data. I placed the data in the directory of bevfusion.
I occurred this error.
I think you maybe just placed the data and the previous pkl files in the directory of bevfusion. You need to regenerate new pkl files or try "ln -s ".
Thanks for your suggestions @Trent-tangtao I solved this issues.
Hi @kentang-mit
I am facing the same issue "there are not enough slots....".
My env:
I am running the command on the cluster using slurm.
When I specify the command as torchpack dist-run -np 1 python tools/test.py
and use only one GPU on cluster it works fine.
But when I want to utilize more gpus i.e 2 or more: torchpack dist-run -np 4 python tools/test.py
by allocating 4 GPUs using slurm command it results in the same error.
As it is working with 1 GPU, I think there is no problem with mpi or openmp (also I am on the cluster). Could you please help me with how can I utilize more GPUs to speed up the process?