AlphaPose icon indicating copy to clipboard operation
AlphaPose copied to clipboard

inference- loading model stuck (inference pics)

Open hsauod opened this issue 4 years ago • 10 comments

Hi team, thank you for your great pose project!

when I inference on Amazon AWS EC2(ubuntu16), everything was going well. When I inference on Google Colab, everything was also good.

But yesterday, when I tried to inference on CentOS + CUDA 11.1, it has been stuck.

Loading YOLO model.. Loading pose model from pretrained_models/fast_res50_256x192.pth... 0%| | 0/695 [00:00<?, ?it/s]/

My test folder contains only 3 pics, no other file types.

Thank you very much.

hsauod avatar Feb 25 '21 01:02 hsauod

Hi, can you try add --sp? Some error info may appear in this mode.

Fang-Haoshu avatar Feb 28 '21 14:02 Fang-Haoshu

hi Fang, thank you for your comment. Add --sp by the end of my command? I tried so, still not work Thank you

hsauod avatar Mar 01 '21 15:03 hsauod

Hi @hsauod , did you try add --vis_fast ? It seems stuck at rendering (sometimes)

NguyenVanThanhHust avatar Mar 04 '21 04:03 NguyenVanThanhHust

这个问题解决了吗? 我也遇到相同的问题

zjj-2015 avatar May 11 '21 11:05 zjj-2015

Same problem. It stuck for "--gpu -1" flag as well, FYI.

atodniAr avatar May 12 '21 07:05 atodniAr

I tried build with pytorch1.4 cuda10.1 devel image following suggestion in another issue #677 with no luck.

Then I finally figured it out testing it running on jupyterlab. It has something to do with tqdm. tqdm stuck it on shell, but not if I run shell command in jupyterlab. Try remove the tqdm part or rewrite it in demo_inference.py.

@zjj-2015 @Fang-Haoshu @hsauod

update: pip install tqdm==4.60 fix this as well

atodniAr avatar May 12 '21 09:05 atodniAr

same problem .

pip install tqdm==4.60 still no luck

korin-lf avatar May 13 '21 01:05 korin-lf

same problem .

pip install tqdm==4.60 still no luck

There is actually another bug caused by .ipynb_checkpoint folder created by jupyterlab that may cause similar stuck problem. So check if there is a .ipynb_checkpoint folder in your examples/demo directory.

I'll sum up things I did to make it work here:

  1. I started with a ubuntu 16.04 distribution with cuda 10.1 driver, v100 gpu.
  2. apt update && apt install cuda-toolkit-10-1 (do this and 3 if nvcc/cublas_v2.h/cublas_api.h related problem occurred when you build)
  3. ln -s /usr/local/cuda-10.2/targets/x86_64-linux/include/cublas_v2.h /usr/local/cuda-10.1/targets/x86_64-linux/include/cublas_v2.h && ln -s /usr/local/cuda-10.2/targets/x86_64-linux/include/cublas_api.h /usr/local/cuda-10.1/targets/x86_64-linux/include/cublas_api.h
  4. install gcc-7
  5. install torch and torchvision, check compatibility here.
  6. python setup.py build develop --user
  7. update tqdm with pip install tqdm==4.60
  8. check if there is unwanted invisible folder in your examples/demo folder, delete them if found (I'll maybe open a merge request for this later)

@korin-lf

atodniAr avatar May 13 '21 03:05 atodniAr

Thank you. I followed 4-7 and got it working on ubuntu 20.04. Now I am on ubuntu 18.04 with no Nvidia card I followed exactly all 2-8 instructions python36, removed all cuda and installed cuda 10.1 and pytorch==1.7.1 torchvision==0.8.2 cpuonly -c pytorch

removed ext_modules=get_ext_modules(), from setup.py as I am on cpu only machine

Now I am getting : File "/home/user/gesture_detect/gesture_detect/models/layers/dcn/deform_conv.py", line 11, in from . import deform_conv_cuda ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

what is the possible way to resolve this? I tried with symlinks but no luck thank you

korin-lf avatar May 19 '21 12:05 korin-lf

我用torch.cuda.is_available()发现结果为False,重新安装cuda后解决了这个问题。 I found the output of 'torch.cuda.is_available()' is False, and I fixed this problem by reinstalling cuda.

以下是环境信息: Here's my environment setting: Ubuntu 16 Python==3.6 PyTorch==1.1.0 Cuda==10.0.130

WeijianZhang123 avatar Aug 04 '21 04:08 WeijianZhang123