TensorRT
TensorRT copied to clipboard
How to correctly format input for Fp16 inference using torch-tensorrt C++
❓ Question
What you have already tried
Hi, I am using the following to export a torch scripted model to Fp16 tensorrt which will then be used in a C++ environment.
`network.load_state_dict(torch.load(path_weights, map_location="cuda:0")) network.eval().cuda()
dummy_input = torch.rand(1, 6, 320, 224).cuda()
network_traced = torch.jit.trace(network, dummy_input) # converting to plain torchscript
# convert/ compile to trt
compile_settings = {
"inputs": [torchtrt.Input([1, 6, 320, 224])],
"enabled_precisions": {torch.half},
"workspace_size": 6 << 22
}
trt_ts_module = torchtrt.compile(network_traced, inputs=[torchtrt.Input((1, 6, 320, 224), dtype=torch.half)],
enabled_precisions={torch.half},
workspace_size=6<<22)
torch.jit.save(trt_ts_module, trt_ts_save_path)`
Is this correct?
If yes, then what is the correct way to cast the input tensor in C++? Do I need to convert it to torck::kHalf explicitly? Or can the inputs stay as FP32
Please let me know.
Here is my code for loading the CNN for inference:
try { // Deserialize the ScriptModule from a file using torch::jit::load(). trt_ts_mod_cnn = torch::jit::load(trt_ts_module_path); trt_ts_mod_cnn.to(torch::kCUDA); cout << trt_ts_mod_cnn.type() << endl; cout << trt_ts_mod_cnn.dump_to_str(true, true, false) << endl; } catch (const c10::Error& e) { std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl; // return -1; } auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\ {at::kCUDA}).to(torch::kFloat32); // auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\ // {at::kCUDA}).to(torch::kFloat16); std::vector<torch::jit::IValue> trt_inputs_ivalues; trt_inputs_ivalues.push_back(inBEVInference); auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple(); auto kp = outputs->elements()[0].toTensor(); auto hwl = outputs->elements()[1].toTensor(); auto rot = outputs->elements()[2].toTensor(); auto dxdy = outputs->elements()[3].toTensor(); cout << "Size KP out -> " << kp.sizes() << endl; cout << "Size HWL out -> " << hwl.sizes() << endl; cout << "Size ROT out -> " << rot.sizes() << endl; cout << "Size DXDY out -> " << dxdy.sizes() << endl;
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- PyTorch Version (e.g., 1.0): 1.11.0+cu113
- CPU Architecture: x86_64
- OS (e.g., Linux): Linux, Ubuntu 20.04, docker container
- How you installed PyTorch (
conda
,pip
,libtorch
, source): pip - Build command you used (if compiling from source):
- Are you using local sources or building from archives: local
- Python version: 3.8.10
- CUDA version: Cuda compilation tools, release 11.4, V11.4.152 (on the linux system)
- GPU models and configuration: RTX2080 MaxQ
- Any other relevant information:
Additional context
try {
// Deserialize the ScriptModule from a file using torch::jit::load().
trt_ts_mod_cnn = torch::jit::load(trt_ts_module_path);
trt_ts_mod_cnn.to(torch::kCUDA);
cout << trt_ts_mod_cnn.type() << endl;
cout << trt_ts_mod_cnn.dump_to_str(true, true, false) << endl;
} catch (const c10::Error& e) {
std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl;
// return -1;
}
auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
{at::kCUDA}).to(torch::kFloat32);
// auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
// {at::kCUDA}).to(torch::kFloat16);
std::vector<torch::jit::IValue> trt_inputs_ivalues;
trt_inputs_ivalues.push_back(inBEVInference);
auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple();
auto kp = outputs->elements()[0].toTensor();
auto hwl = outputs->elements()[1].toTensor();
auto rot = outputs->elements()[2].toTensor();
auto dxdy = outputs->elements()[3].toTensor();
cout << "Size KP out -> " << kp.sizes() << endl;
cout << "Size HWL out -> " << hwl.sizes() << endl;
cout << "Size ROT out -> " << rot.sizes() << endl;
cout << "Size DXDY out -> " << dxdy.sizes() << endl;
The way you build your model seems fine, seems like you explicitly set the input to expect FP16 then provide an FP32 tensor during inference. You can set it to expect an FP32 input during compilation otherwise use the commented out line in your C++ to provide the input tensor
Thank you for your quick response. Here is what I did:
1) Converting to torch-tensorrt via Python: FP16
- I removed the explicit dtype specification
def convert_trained_model_to_tensorrt_script_fp16(network, path_weights, trt_ts_save_path): """ Convert the model instance and weights to tensorrt optimized script """
network.load_state_dict(torch.load(path_weights, map_location="cuda:0"))
network.eval()
dummy_input = torch.rand(1, 6, 320, 224).cuda()
network_traced = torch.jit.trace(network, dummy_input) # converting to plain torchscript
# convert/ compile to trt
compile_settings = {
"inputs": [torchtrt.Input([1, 6, 320, 224])],
"enabled_precisions": {torch.float, torch.half},
"workspace_size": 5 << 22
}
# trt_ts_module = torchtrt.compile(network_traced, dtype=torch.float16, **compile_settings)
# trt_ts_module = torchtrt.compile(network_traced, inputs=[torchtrt.Input((1, 6, 320, 224), dtype=torch.half)],
# enabled_precisions={torch.float, torch.half},
# workspace_size=5<<22)
trt_ts_module = torchtrt.compile(network_traced, inputs=[torchtrt.Input((1, 6, 320, 224))],
enabled_precisions={torch.half},
workspace_size=5<<22)
res = trt_ts_module.forward(dummy_input)
torch.jit.save(trt_ts_module, trt_ts_save_path)
2) Inference part
void cnnRunInference(pointCloudData& pcl3d, const char* trt_ts_module_path, s_bevSettings& bevSettings, predictedObjects& predObjects, timeTraceRecord& tRecord) { static auto RUN_COUNT = 0; static auto CNN_INIT_DONE = 0; static torch::jit::Module trt_ts_mod_cnn; // static torch::Tensor bevInferenceInput = torch::zeros({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV}, torch::dtype(torch::kHalf));
// 1. Check and init CNN if not done already
if(!CNN_INIT_DONE)
{
try {
// Deserialize the ScriptModule from a file using torch::jit::load().
trt_ts_mod_cnn = torch::jit::load(trt_ts_module_path);
trt_ts_mod_cnn.to(torch::kCUDA);
cout << trt_ts_mod_cnn.type() << endl;
cout << trt_ts_mod_cnn.dump_to_str(true, true, false) << endl;
} catch (const c10::Error& e) {
std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl;
// return -1;
}
auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
{at::kCUDA}).to(torch::kFloat32);
// auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
// {at::kCUDA}).to(torch::kHalf);
std::vector<torch::jit::IValue> trt_inputs_ivalues;
trt_inputs_ivalues.push_back(inBEVInference);
auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple();
auto kp = outputs->elements()[0].toTensor();
auto hwl = outputs->elements()[1].toTensor();
auto rot = outputs->elements()[2].toTensor();
auto dxdy = outputs->elements()[3].toTensor();
cout << "Size KP out -> " << kp.sizes() << endl;
cout << "Size HWL out -> " << hwl.sizes() << endl;
cout << "Size ROT out -> " << rot.sizes() << endl;
cout << "Size DXDY out -> " << dxdy.sizes() << endl;
torch::cuda::synchronize();
CNN_INIT_DONE = 1;
printf("=========== CNN load completed successfully ==================\n");
}
// 2. get BEV array
// auto t1 = high_resolution_clock::now();
array4DFloat bevInference(1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV, 0.0f);
getBevFromPointCloud(pcl3d, bevSettings, bevInference, 0, true); // returns normalized BEV
// auto t2 = high_resolution_clock::now();
// auto ms_int = duration_cast<milliseconds>(t2 - t1);
// tRecord.addFunctionTimes("bevCreation", ms_int.count());
// printf("BEV creation time : %ld ms\n", ms_int.count());
// std::vector<int> chans = {0, 1, 2};
// dbg_dumpBEVToNpy(bevInference, chans);
// 3. Convert to tensor CUDA
// t1 = high_resolution_clock::now();
auto tensorOptions = torch::TensorOptions().dtype(torch::kFloat32).pinned_memory(true);
// auto tensorOptions = torch::TensorOptions().dtype(torch::kHalf).pinned_memory(true);
auto bevInferenceCUDATensor = torch::from_blob(bevInference.array,
{1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},
tensorOptions).to(torch::kCUDA, true).to(torch::kFloat32);
torch::cuda::synchronize();
// t2 = high_resolution_clock::now();
// ms_int = duration_cast<milliseconds>(t2 - t1);
// tRecord.addFunctionTimes("bev->GPU", ms_int.count());
// printf("BEV to GPU time : %ld ms\n", ms_int.count());
// cout << bevInferenceCUDATensor.sizes() << endl;
// cout << bevInferenceCUDATensor.get_device() << endl;
// cout << "Is tensor on GPU? -> " << bevInferenceCUDATensor.is_cuda() << endl;
// 4. run inference and get base predictions
// auto t1 = high_resolution_clock::now();
std::vector<torch::jit::IValue> trt_inputs_ivalues;
trt_inputs_ivalues.push_back(bevInferenceCUDATensor);
auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple(); // let the data stay on GPU
- Observation
Build works with this warning message:
The code finally runs with sensible outputs. However, performance (accuracy) in FP16 mode is extremely bad wrt FP32 Also, I do not see almost any speed gain - both FP32 and FP16 run at 9-13 ms/frame.
- Previously with Python + torch2trt (outdated now), I have seen consistent performance and significant speed gains (x2).
- Questions
- What could be the reason behind this large drop in performance?
- Is the performance affected by GPU architecture? My ultimate target is Xavier AGX
- Is performance linked to compatibility between torch-tensorrt, CUDA and CUDNN?
- has the warning message shown above got to do anything with performance?
Best Regards Sambit
What could be the reason behind this large drop in performance?
Could be a number of reasons hard to say for sure. Could be that casting in the TRT network is especially slow and perhaps precasting upstream would help. You can compare the built engines in depth using tools like trex https://developer.nvidia.com/blog/exploring-tensorrt-engines-with-trex/
Is the performance affected by GPU architecture? My ultimate target is Xavier AGX
Yes. Engines built on different hardware will perform different. Also you must build your engine on your target deployment hardware since TensorRT tunes specifically for a specific GPU.
Is performance linked to compatibility between torch-tensorrt, CUDA and CUDNN?
Using different library versions than the specified ones may effect performance based on what kernels are selected but usually this is negligible compared to other factors
has the warning message shown above got to do anything with performance?
This would not be the first thing I look at when it comes to performance tuning. But if you have exhausted everything else its something worth trying.
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days