inference time slow than pytorch model
Thank you so much for sharing, I find that the infer speed slows down a lot after using this library. I don't know if it is the problem of environment. The infer result is correct, which makes me very confused. I hope you can analyze this problem.
img_size (880, 660) batch_size = 8 total_img_num = 3538
pytorch model batch_infer_time = 7 ms total_infer_time = 4460 ms modeltrt batch_infer_time = 60 ms total_infer_time = 29 s
C++ modeltrt total_infer_time = 267 ms
@wyukai Can you share the code of your test inference? The difference may be coused by copying the image to the device. You might have noticed, that one must provide an ndarray image as input to the apply method. This implies that under the hood the copy operation is executed(to the device, and from the device). So, to make a fair comparison, you have to include this copy operation to the pytorch pipeline.
RT_Test
def predict_by_batch_pip(batch_size, imgs, modeltrt):
try:
pred_list = []
imgs = torch.stack(imgs)
start = time.time()
output = modeltrt.apply(imgs)
end = time.time()
print('batch infer time ::', end - start)
global pure_infer_time
pure_infer_time = pure_infer_time + (end - start)
preds = output[0]
preds = [preds[i:i + 15] for i in range(0, len(preds), 15)]
pred_list = np.argsort(preds)[:, -1].tolist()
return pred_list
except Exception as e:
print('Exception:', e)
print('predict by batch pip error')
pytorch_test
def predict_by_batch_pip(batch_size, imgs, model):
try:
pred_list = []
imgs = torch.stack(imgs)
with torch.no_grad():
img_per_bath = imgs
if torch.cuda.is_available():
input_var = torch.autograd.Variable(img_per_bath.cuda(async=True))
else:
input_var = torch.autograd.Variable(torch.from_numpy(np.array(img_per_bath)).float().contiguous())
s = time.time()
output = model(input_var)
print("infer_time :: ", time.time() - s)
global pure_infer_time
pure_infer_time = pure_infer_time + (time.time() - s)
input_var = torch.autograd.Variable(img_per_bath)
_, pred = output.data.cpu().topk(1, dim=1)
for i in pred.numpy():
pred_list.append(i[0])
return pred_list
except Exception as e:
print('Exception:', e)
print('predict by batch pip error')
The above is my infer code, can you analyze it for me ?
test data (3558 images) img_size: 880*660 batch_size = 8, thread = 1, gpu =0 ;
C++ TensorRT
FP32 total_infer_time = 267ms total_time = 69001ms
FP16 total_infer_time = 246ms total_time = 62728ms
python / pytorch
total_infer_time = 4.4s total_time = 94s
python / TensorRT
FP32 total_infer_time = 29s total_time = 98s
test data (3558 images) img_size: 880*660 batch_size = 16, thread = 1, gpu =0 ;
C++ TensorRT
FP32 total_infer_time = 136ms total_time = 66004ms
FP16 total_infer_time = 126ms total_time = 60617ms
python / TensorRT
total_infer_time = 3.1s total_time = 90s
python / TensorRT
FP16 total_infer_time = 15s total_time = 81s
batch_size = 16 FP16
C++ TensorRT total_time = 60s
python TensorRT total_time = 81s
pytorch total_time = 90s
@wyukai Try to run something like this for your pytorch code:
`def predict_by_batch_pip(batch_size, imgs, model): try: pred_list = [] imgs = torch.stack(imgs)
s = time.time()
with torch.no_grad():
img_per_bath = imgs
if torch.cuda.is_available():
input_var = torch.autograd.Variable(img_per_bath.cuda(async=True))
else:
input_var = torch.autograd.Variable(torch.from_numpy(np.array(img_per_bath)).float().contiguous())
output = model(input_var)
_, pred = output.data.cpu().topk(1, dim=1)
#convert you preds to numpy here
print("infer_time :: ", time.time() - s)
global pure_infer_time
pure_infer_time = pure_infer_time + (time.time() - s)
#for i in pred.numpy():
#pred_list.append(i[0])
return pred_list
except Exception as e:
print('Exception:', e)
print('predict by batch pip error')`
The key idea is that for a fair comparison you must take memory copying into account.
I understand what you mean, but the inference time of python tensorRT is still behind that of C++ TensorRT. Is this normal?
Perhaps this gap between C++ vs Python TRT can be solved by providing proper optimization level to CMakeLists.txt. I have just pushed some modifications to CMakeLists.txt in master branch. I ll run my own tests as soon as i can, but if you cant wait, you can build the lib, and check yourself.
Can I compile this library in Windows 10 python3.6? I have tried to compile many times, but all failed. The above test results use the prebuild model that you provided.
Could you please tell me the email address so that we can communicate more conveniently?
Hello. Have you continued to test this speed problem?
Perhaps this gap between C++ vs Python TRT can be solved by providing proper optimization level to CMakeLists.txt. I have just pushed some modifications to CMakeLists.txt in master branch. I ll run my own tests as soon as i can, but if you cant wait, you can build the lib, and check yourself.
@wyukai Yes, i have. I have tested a few different compiler options that specify gpu architectures. Unfortunately i havent noticed any notable boost from it. I also spent some time profiling the code, though a also failed to improve the performance. I assume that theese performance limitations are caused by python itself.
Concerning your compiling problems: I faced that kind of problems when the version installed Visual Studio was insufficient to the version of CUDA.