tensorrt_models icon indicating copy to clipboard operation
tensorrt_models copied to clipboard

inference time slow than pytorch model

Open wyukai opened this issue 5 years ago • 10 comments

Thank you so much for sharing, I find that the infer speed slows down a lot after using this library. I don't know if it is the problem of environment. The infer result is correct, which makes me very confused. I hope you can analyze this problem.

img_size (880, 660) batch_size = 8 total_img_num = 3538

pytorch model batch_infer_time = 7 ms total_infer_time = 4460 ms modeltrt batch_infer_time = 60 ms total_infer_time = 29 s

C++ modeltrt total_infer_time = 267 ms

wyukai avatar Aug 28 '20 09:08 wyukai

@wyukai Can you share the code of your test inference? The difference may be coused by copying the image to the device. You might have noticed, that one must provide an ndarray image as input to the apply method. This implies that under the hood the copy operation is executed(to the device, and from the device). So, to make a fair comparison, you have to include this copy operation to the pytorch pipeline.

KorovkoAlexander avatar Sep 03 '20 10:09 KorovkoAlexander

RT_Test

def predict_by_batch_pip(batch_size, imgs, modeltrt):
    try:
        pred_list = []
        imgs = torch.stack(imgs)
        start = time.time()
        output = modeltrt.apply(imgs)
        end = time.time()
        print('batch infer time ::', end - start)
        global pure_infer_time
        pure_infer_time = pure_infer_time + (end - start)
        preds = output[0]
        preds = [preds[i:i + 15] for i in range(0, len(preds), 15)]
        pred_list = np.argsort(preds)[:, -1].tolist()
        return pred_list
    except Exception as e:
        print('Exception:', e)
        print('predict by batch pip error')

pytorch_test

def predict_by_batch_pip(batch_size, imgs, model): 
    try:
        pred_list = []
        imgs = torch.stack(imgs)
        with torch.no_grad():
            img_per_bath = imgs
            if torch.cuda.is_available():
                input_var = torch.autograd.Variable(img_per_bath.cuda(async=True))
            else:
                input_var = torch.autograd.Variable(torch.from_numpy(np.array(img_per_bath)).float().contiguous())
            s = time.time()
            output = model(input_var)
            print("infer_time :: ", time.time() - s)
            global pure_infer_time
            pure_infer_time = pure_infer_time + (time.time() - s)
            input_var = torch.autograd.Variable(img_per_bath)
            _, pred = output.data.cpu().topk(1, dim=1)
            for i in pred.numpy():
                pred_list.append(i[0])
        return pred_list
    except Exception as e:
        print('Exception:', e)
        print('predict by batch pip error')

The above is my infer code, can you analyze it for me ?

wyukai avatar Sep 03 '20 10:09 wyukai

test data (3558 images)  img_size: 880*660  batch_size = 8, thread = 1,    gpu =0 ;

C++  TensorRT
FP32    total_infer_time  = 267ms     total_time = 69001ms
FP16    total_infer_time  = 246ms     total_time = 62728ms

python / pytorch
total_infer_time  = 4.4s     total_time = 94s

python / TensorRT
FP32   total_infer_time  = 29s     total_time = 98s
test data (3558 images)  img_size: 880*660  batch_size = 16, thread = 1,    gpu =0 ;

C++  TensorRT
FP32    total_infer_time  = 136ms     total_time = 66004ms
FP16    total_infer_time  = 126ms     total_time = 60617ms

python / TensorRT
total_infer_time  = 3.1s     total_time = 90s

python / TensorRT
FP16   total_infer_time  = 15s     total_time = 81s

batch_size =  16   FP16  
    C++ TensorRT  total_time = 60s        
    python  TensorRT total_time = 81s
    pytorch  total_time = 90s 

wyukai avatar Sep 03 '20 11:09 wyukai

@wyukai Try to run something like this for your pytorch code:

`def predict_by_batch_pip(batch_size, imgs, model): try: pred_list = [] imgs = torch.stack(imgs)

    s = time.time()

    with torch.no_grad():
        img_per_bath = imgs
        if torch.cuda.is_available():
            input_var = torch.autograd.Variable(img_per_bath.cuda(async=True))
        else:
            input_var = torch.autograd.Variable(torch.from_numpy(np.array(img_per_bath)).float().contiguous())
        output = model(input_var)

        _, pred = output.data.cpu().topk(1, dim=1)
        #convert you preds to numpy here

        print("infer_time :: ", time.time() - s)
        global pure_infer_time
        pure_infer_time = pure_infer_time + (time.time() - s)

        #for i in pred.numpy():
        #pred_list.append(i[0])
    return pred_list
except Exception as e:
    print('Exception:', e)
    print('predict by batch pip error')`

The key idea is that for a fair comparison you must take memory copying into account.

KorovkoAlexander avatar Sep 03 '20 12:09 KorovkoAlexander

I understand what you mean, but the inference time of python tensorRT is still behind that of C++ TensorRT. Is this normal?

wyukai avatar Sep 03 '20 13:09 wyukai

Perhaps this gap between C++ vs Python TRT can be solved by providing proper optimization level to CMakeLists.txt. I have just pushed some modifications to CMakeLists.txt in master branch. I ll run my own tests as soon as i can, but if you cant wait, you can build the lib, and check yourself.

KorovkoAlexander avatar Sep 03 '20 14:09 KorovkoAlexander

Can I compile this library in Windows 10 python3.6? I have tried to compile many times, but all failed. The above test results use the prebuild model that you provided.

wyukai avatar Sep 04 '20 02:09 wyukai

Could you please tell me the email address so that we can communicate more conveniently?

wyukai avatar Sep 04 '20 02:09 wyukai

Hello. Have you continued to test this speed problem?

Perhaps this gap between C++ vs Python TRT can be solved by providing proper optimization level to CMakeLists.txt. I have just pushed some modifications to CMakeLists.txt in master branch. I ll run my own tests as soon as i can, but if you cant wait, you can build the lib, and check yourself.

wyukai avatar Sep 23 '20 09:09 wyukai

@wyukai Yes, i have. I have tested a few different compiler options that specify gpu architectures. Unfortunately i havent noticed any notable boost from it. I also spent some time profiling the code, though a also failed to improve the performance. I assume that theese performance limitations are caused by python itself.

Concerning your compiling problems: I faced that kind of problems when the version installed Visual Studio was insufficient to the version of CUDA.

KorovkoAlexander avatar Sep 23 '20 11:09 KorovkoAlexander