AlphaPose icon indicating copy to clipboard operation
AlphaPose copied to clipboard

High CPU Usage, not fully use GPU

Open zhongyi-zhou opened this issue 5 years ago • 35 comments

running uses 100% of my 10 cores CPU but it only uses about 50% of my 2080ti, though it is usually fluctuating (20% ~80%). I guess there should some high burden on CPU computation, which dramatically slows down computation efficiency.

Any idea on how to fix this?

zhongyi-zhou avatar Oct 15 '19 13:10 zhongyi-zhou

The result from --profile is:

det time: 0.058 | pose time: 0.08 | post processing: 0.0616

zhongyi-zhou avatar Oct 15 '19 13:10 zhongyi-zhou

Hi, the process of cropping every single person from the image will use CPU a lot. How many people are there in an image in your case?

Fang-Haoshu avatar Oct 17 '19 14:10 Fang-Haoshu

3 people. Do you mean the person bbox dection part by cropping?

Also, I noticed that the beginning several iterations have 0 det time cost, which is a bit strange.

zhongyi-zhou avatar Oct 17 '19 16:10 zhongyi-zhou

It's a bit weird. Are you running with --sp?

Fang-Haoshu avatar Oct 18 '19 05:10 Fang-Haoshu

@Fang-Haoshu Yes. Otherwise, it would appear errors.

zhongyi-zhou avatar Oct 18 '19 20:10 zhongyi-zhou

Are you running under Windows? If under Linux, what error would occur?

Fang-Haoshu avatar Oct 19 '19 15:10 Fang-Haoshu

No. I am under Linux. The error is the same as discussed in this issue: I notice that there is also one person @GuoHaiYang123 in that issue who meets the same problem with me.

zhongyi-zhou avatar Oct 20 '19 10:10 zhongyi-zhou

I guess the latest pytorch branch has fixed it? Are you running with the latest code?

Fang-Haoshu avatar Oct 26 '19 04:10 Fang-Haoshu

I can't read chinese unfortunately, but am experiencing the same issue. a 9700K is bottlenecking (8 cores running at 100%) a GTX1070 (70% gpu utilization). I also have another setup where a weaker mobile CPU (4 cores running at 100%) is bottlenecking a 2070 egpu (30% utilization). Windows both cases.

This is the observed behaviour whether I use the webcam or video processing script. With one human in the frame I get 13fps for the first described setup, and 6fps for the second.

trekze avatar Oct 27 '19 14:10 trekze

Here is a visualization of the CPU usage, using snakeviz:

python -m cProfile -o temp.dat --conf 0.5 --nms 0.45 --inp_dim 480 --sp --video p1.mp4


trekze avatar Oct 27 '19 15:10 trekze

and the profile figures:

det time: 0.018 | pose time: 0.06 | post processing: 0.0029

trekze avatar Oct 27 '19 15:10 trekze

Hi @hmexx , if this problem happens for the video, I guess it's related to video decoding. Perhaps you need to install some video decoder. It can consume a lot of CPU usages.

Fang-Haoshu avatar Oct 28 '19 07:10 Fang-Haoshu

Hi @Fang-Haoshu

This can not be a video decoding issue. It's happening both for video files and the webcam, and the video is an uncompressed 480p video, that can be decoded for less than 1% of the CPU power available. Also, the profile figures show that most of the time is spent in pose time:

Any other ideas?


trekze avatar Oct 28 '19 08:10 trekze

@Fang-Haoshu Could it be that AlphaPose is CPU-bound where there are few people in each frame (e.g. 1) ? Are all your test runs with many people (e.g. 4 mentioned in

All our videos have 1 person.

trekze avatar Oct 28 '19 10:10 trekze

Hi, both video and webcam use cv2.videoCapture: I think it's the same point that they share. AlphaPose should use less CPU with fewer people. It will only use more CPU resources when there are many people, for cropping the people from images.

Fang-Haoshu avatar Oct 30 '19 11:10 Fang-Haoshu

For my laptop, it consumes little CPU resource when running webcam demo. Thus I still guess it's related to cv2.videocapture

Fang-Haoshu avatar Oct 30 '19 11:10 Fang-Haoshu

It's not videocapture. I've measured it and it takes very little CPU. Is your laptop linux or windows? How many people are in each frame you are testing with it?

We've managed to get a 3x fps increase using the following strategy:

  1. Concatenate 4 image frames, and YOLO/pose them simultaneously (to increase the number of poses per frame from 1 to 4). This increases the GPU utilization.
  1. Optimize the cropBox method, where most of the CPU time is spent. I'm including it below. I've removed the warpAffine call. Feel free to reuse, and do let me know if you think this alternative implementation will break pose estimation in certain cases. It seems to work ok for us.


def cropBox_fast(img, ul, br, resH, resW):
    ul =
    br = (br - 1).int()
    lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
    lenW = lenH * resW / resH
    if img.dim() == 2:
        img = img[np.newaxis, :]

    crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
    pad_h = lenH - crop_img.shape[1]
    pad_w = lenW - crop_img.shape[2]
    pad_img = cv2.copyMakeBorder(torch_to_im(crop_img), int(pad_h/2), int(pad_h/2), int(pad_w/2), int(pad_w/2), cv2.BORDER_CONSTANT, value=(0, 0, 0))
    sized_img = cv2.resize(pad_img, (resW, resH),interpolation=cv2.INTER_NEAREST)
    return im_to_torch(torch.Tensor(sized_img))

trekze avatar Oct 30 '19 17:10 trekze

@hmexx @Fang-Haoshu What about taking a random video from YouTube and we test together for this fixed video and share the info here?

zhongyi-zhou avatar Oct 31 '19 00:10 zhongyi-zhou

Oh I see. Thanks @hmexx ! Good idea Joey. How about try this video?

Fang-Haoshu avatar Nov 02 '19 04:11 Fang-Haoshu

Oh wait, I found that it also consumes a lot CPU on my side 😂 Sorry for my wrong statements before 😂 I did not notice that because I was using a server with 56 CPU cores. Now it consumes about 1000% CPU usage. 微信图片_20191102131116

I guess we will need to optimize the CPU utility after CVPR ddl..

Fang-Haoshu avatar Nov 02 '19 05:11 Fang-Haoshu

@Fang-Haoshu Good Luck on CVPR! I am glad to help with this if you need. Let me know if you start this optimization.

zhongyi-zhou avatar Nov 05 '19 01:11 zhongyi-zhou

Here's how to further speed up cropping. Another 2x speed-up:

def cropBox_fast(img, ul, br, resH, resW):
    ul =
    br = (br - 1).int()
    lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
    lenW = lenH * resW / resH
    if img.dim() == 2:
        img = img[np.newaxis, :]
    crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
    pad_h = lenH - crop_img.shape[1]
    pad_w = lenW - crop_img.shape[2]
    pad_img = F.pad(crop_img, (int(pad_w/2), int(pad_w/2), int(pad_h/2), int(pad_h/2)), 'constant', 0)
    sized_img = F.interpolate(pad_img.unsqueeze(0),size=(resH, resW)).squeeze(0)
    return sized_img

def crop_from_dets(img, boxes, inps, pt1, pt2):
    Crop human from origin image according to Dectecion Results

    imght = img.size(1)
    imgwidth = img.size(2)
    tmp_img = img
    tmp_img = tmp_img.cuda()
    for i, box in enumerate(boxes):
        upLeft = torch.Tensor(
            (float(box[0]), float(box[1])))
        bottomRight = torch.Tensor(
            (float(box[2]), float(box[3])))

        ht = bottomRight[1] - upLeft[1]
        width = bottomRight[0] - upLeft[0]

        scaleRate = 0.3
        t1 = time.time()
        upLeft[0] = max(0, upLeft[0] - width * scaleRate / 2)
        upLeft[1] = max(0, upLeft[1] - ht * scaleRate / 2)

        bottomRight[0] = max(
            min(imgwidth - 1, bottomRight[0] + width * scaleRate / 2), upLeft[0] + 5)
        bottomRight[1] = max(
            min(imght - 1, bottomRight[1] + ht * scaleRate / 2), upLeft[1] + 5)
            c = tmp_img.clone()
            inps[i] = cropBox_fast(c, upLeft, bottomRight, opt.inputResH, opt.inputResW)
        except IndexError:
        pt1[i] = upLeft
        pt2[i] = bottomRight

    return inps, pt1, pt2

trekze avatar Nov 08 '19 20:11 trekze

There's also some code that speeds up the YOLO NMS process slightly if anyone is interested.

trekze avatar Nov 09 '19 11:11 trekze

@hmexx Please post the speed up code

arvindixonos avatar Nov 23 '19 13:11 arvindixonos

@hmexx Hi! Many thanks! We are now working actively for new version alphapose and would like to include the speed up part. Would you mind sharing us the code? Many thanks!

Fang-Haoshu avatar Dec 12 '19 14:12 Fang-Haoshu

Hi there. Sorry just saw this. Most of the speed up is the code above. There's a tiny bit more in the NMS we changed. Do you want that bit?

trekze avatar Jan 12 '20 12:01 trekze

@hmexx Hi , can you post nms change cod ? It will be very useful. Thanks.

Ndron avatar Jan 16 '20 13:01 Ndron

Is there any update to change the CPU load to GPU? @Fang-Haoshu

zhongyi-zhou avatar Mar 11 '20 08:03 zhongyi-zhou

I'm getting this issue too on with --video (which I assume was --detbatch, --posebatch and --sp does't help. -sp helps but my OS grinds to a halt eventually on longer videos. I'm running 12GB RAM, 8 core CPU with 2070 RTX Super.

Happy to help to get it resolved if I can in any way. I'm newish to python, but not to coding so I'm sure I can get something of use. Great project.

jmwill86 avatar Mar 28 '20 22:03 jmwill86

Any update on this issue ? When I run it uses about 12% of the gpu (RTX 3090) and 100% of the cpu

samymdihi avatar May 18 '21 09:05 samymdihi