TensorRT How to correctly format input for Fp16 inference using torch-tensorrt C++

❓ Question

What you have already tried

Hi, I am using the following to export a torch scripted model to Fp16 tensorrt which will then be used in a C++ environment.

`network.load_state_dict(torch.load(path_weights, map_location="cuda:0")) network.eval().cuda()

dummy_input = torch.rand(1, 6, 320, 224).cuda()
network_traced = torch.jit.trace(network, dummy_input)  # converting to plain torchscript

# convert/ compile to trt
compile_settings = {
    "inputs": [torchtrt.Input([1, 6, 320, 224])],
    "enabled_precisions": {torch.half},
    "workspace_size": 6 << 22
}

trt_ts_module = torchtrt.compile(network_traced, inputs=[torchtrt.Input((1, 6, 320, 224), dtype=torch.half)],
                                enabled_precisions={torch.half},
                                workspace_size=6<<22)
torch.jit.save(trt_ts_module, trt_ts_save_path)`

Is this correct?

If yes, then what is the correct way to cast the input tensor in C++? Do I need to convert it to torck::kHalf explicitly? Or can the inputs stay as FP32

Please let me know.

Here is my code for loading the CNN for inference:

try { // Deserialize the ScriptModule from a file using torch::jit::load(). trt_ts_mod_cnn = torch::jit::load(trt_ts_module_path); trt_ts_mod_cnn.to(torch::kCUDA); cout << trt_ts_mod_cnn.type() << endl; cout << trt_ts_mod_cnn.dump_to_str(true, true, false) << endl; } catch (const c10::Error& e) { std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl; // return -1; } auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\ {at::kCUDA}).to(torch::kFloat32); // auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\ // {at::kCUDA}).to(torch::kFloat16); std::vector<torch::jit::IValue> trt_inputs_ivalues; trt_inputs_ivalues.push_back(inBEVInference); auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple(); auto kp = outputs->elements()[0].toTensor(); auto hwl = outputs->elements()[1].toTensor(); auto rot = outputs->elements()[2].toTensor(); auto dxdy = outputs->elements()[3].toTensor(); cout << "Size KP out -> " << kp.sizes() << endl; cout << "Size HWL out -> " << hwl.sizes() << endl; cout << "Size ROT out -> " << rot.sizes() << endl; cout << "Size DXDY out -> " << dxdy.sizes() << endl;

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

PyTorch Version (e.g., 1.0): 1.11.0+cu113
CPU Architecture: x86_64
OS (e.g., Linux): Linux, Ubuntu 20.04, docker container
How you installed PyTorch (conda, pip, libtorch, source): pip
Build command you used (if compiling from source):
Are you using local sources or building from archives: local
Python version: 3.8.10
CUDA version: Cuda compilation tools, release 11.4, V11.4.152 (on the linux system)
GPU models and configuration: RTX2080 MaxQ
Any other relevant information:

Additional context

Aug 23 '22 14:08 SM1991CODES

try {
        // Deserialize the ScriptModule from a file using torch::jit::load().
        trt_ts_mod_cnn = torch::jit::load(trt_ts_module_path);
        trt_ts_mod_cnn.to(torch::kCUDA);
        cout << trt_ts_mod_cnn.type() << endl;
        cout << trt_ts_mod_cnn.dump_to_str(true, true, false) << endl;
        } catch (const c10::Error& e) {
            std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl;
            // return -1;
        }
        auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
                                            {at::kCUDA}).to(torch::kFloat32);
        // auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
        //                                     {at::kCUDA}).to(torch::kFloat16);
        std::vector<torch::jit::IValue> trt_inputs_ivalues;
        trt_inputs_ivalues.push_back(inBEVInference);
        auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple();
        auto kp = outputs->elements()[0].toTensor();
        auto hwl = outputs->elements()[1].toTensor();
        auto rot = outputs->elements()[2].toTensor();
        auto dxdy = outputs->elements()[3].toTensor();
        cout << "Size KP out -> " << kp.sizes() << endl;
        cout << "Size HWL out -> " << hwl.sizes() << endl;
        cout << "Size ROT out -> " << rot.sizes() << endl;
        cout << "Size DXDY out -> " << dxdy.sizes() << endl;

Aug 23 '22 14:08 SM1991CODES

The way you build your model seems fine, seems like you explicitly set the input to expect FP16 then provide an FP32 tensor during inference. You can set it to expect an FP32 input during compilation otherwise use the commented out line in your C++ to provide the input tensor

Aug 23 '22 15:08 narendasan

Thank you for your quick response. Here is what I did:

1) Converting to torch-tensorrt via Python: FP16

I removed the explicit dtype specification

def convert_trained_model_to_tensorrt_script_fp16(network, path_weights, trt_ts_save_path): """ Convert the model instance and weights to tensorrt optimized script """

network.load_state_dict(torch.load(path_weights, map_location="cuda:0"))
network.eval()

dummy_input = torch.rand(1, 6, 320, 224).cuda()
network_traced = torch.jit.trace(network, dummy_input)  # converting to plain torchscript

# convert/ compile to trt
compile_settings = {
    "inputs": [torchtrt.Input([1, 6, 320, 224])],
    "enabled_precisions": {torch.float, torch.half},
    "workspace_size": 5 << 22
}

# trt_ts_module = torchtrt.compile(network_traced, dtype=torch.float16, **compile_settings)
# trt_ts_module = torchtrt.compile(network_traced, inputs=[torchtrt.Input((1, 6, 320, 224), dtype=torch.half)],
#                                 enabled_precisions={torch.float, torch.half},
#                                 workspace_size=5<<22)

trt_ts_module = torchtrt.compile(network_traced, inputs=[torchtrt.Input((1, 6, 320, 224))],
                                enabled_precisions={torch.half},
                                workspace_size=5<<22)
res = trt_ts_module.forward(dummy_input)
torch.jit.save(trt_ts_module, trt_ts_save_path)

2) Inference part

void cnnRunInference(pointCloudData& pcl3d, const char* trt_ts_module_path, s_bevSettings& bevSettings, predictedObjects& predObjects, timeTraceRecord& tRecord) { static auto RUN_COUNT = 0; static auto CNN_INIT_DONE = 0; static torch::jit::Module trt_ts_mod_cnn; // static torch::Tensor bevInferenceInput = torch::zeros({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV}, torch::dtype(torch::kHalf));

// 1. Check and init CNN if not done already
if(!CNN_INIT_DONE)
{
    try {
    // Deserialize the ScriptModule from a file using torch::jit::load().
    trt_ts_mod_cnn = torch::jit::load(trt_ts_module_path);
    trt_ts_mod_cnn.to(torch::kCUDA);
    cout << trt_ts_mod_cnn.type() << endl;
    cout << trt_ts_mod_cnn.dump_to_str(true, true, false) << endl;
    } catch (const c10::Error& e) {
        std::cerr << "error loading the model from : " << trt_ts_module_path << std::endl;
        // return -1;
    }
    auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
                                        {at::kCUDA}).to(torch::kFloat32);
    // auto inBEVInference = torch::rand({1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},\
    //                                     {at::kCUDA}).to(torch::kHalf);
    std::vector<torch::jit::IValue> trt_inputs_ivalues;
    trt_inputs_ivalues.push_back(inBEVInference);
    auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple();
    auto kp = outputs->elements()[0].toTensor();
    auto hwl = outputs->elements()[1].toTensor();
    auto rot = outputs->elements()[2].toTensor();
    auto dxdy = outputs->elements()[3].toTensor();
    cout << "Size KP out -> " << kp.sizes() << endl;
    cout << "Size HWL out -> " << hwl.sizes() << endl;
    cout << "Size ROT out -> " << rot.sizes() << endl;
    cout << "Size DXDY out -> " << dxdy.sizes() << endl;

    torch::cuda::synchronize();
    CNN_INIT_DONE = 1;
    printf("=========== CNN load completed successfully ==================\n");
}

// 2. get BEV array
// auto t1 = high_resolution_clock::now();
array4DFloat bevInference(1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV, 0.0f);
getBevFromPointCloud(pcl3d, bevSettings, bevInference, 0, true); // returns normalized BEV
// auto t2 = high_resolution_clock::now();
// auto ms_int = duration_cast<milliseconds>(t2 - t1);
// tRecord.addFunctionTimes("bevCreation", ms_int.count());
// printf("BEV creation time : %ld ms\n", ms_int.count());

// std::vector<int> chans = {0, 1, 2};
// dbg_dumpBEVToNpy(bevInference, chans);

// 3. Convert to tensor CUDA
// t1 = high_resolution_clock::now();
auto tensorOptions = torch::TensorOptions().dtype(torch::kFloat32).pinned_memory(true);
// auto tensorOptions = torch::TensorOptions().dtype(torch::kHalf).pinned_memory(true);
auto bevInferenceCUDATensor = torch::from_blob(bevInference.array, 
                                                {1, bevSettings.N_CHANNELS_BEV, bevSettings.N_ROWS_BEV, bevSettings.N_COLS_BEV},
                                                 tensorOptions).to(torch::kCUDA, true).to(torch::kFloat32);
torch::cuda::synchronize();
// t2 = high_resolution_clock::now();
// ms_int = duration_cast<milliseconds>(t2 - t1);
// tRecord.addFunctionTimes("bev->GPU", ms_int.count());
// printf("BEV to GPU time : %ld ms\n", ms_int.count());

// cout << bevInferenceCUDATensor.sizes() << endl;
// cout << bevInferenceCUDATensor.get_device() << endl;
// cout << "Is tensor on GPU? -> " << bevInferenceCUDATensor.is_cuda() << endl;

// 4. run inference and get base predictions
// auto t1 = high_resolution_clock::now();
std::vector<torch::jit::IValue> trt_inputs_ivalues;
trt_inputs_ivalues.push_back(bevInferenceCUDATensor);

auto outputs = trt_ts_mod_cnn.forward(trt_inputs_ivalues).toTuple(); // let the data stay on GPU

Observation

Build works with this warning message:

The code finally runs with sensible outputs. However, performance (accuracy) in FP16 mode is extremely bad wrt FP32 Also, I do not see almost any speed gain - both FP32 and FP16 run at 9-13 ms/frame.

Previously with Python + torch2trt (outdated now), I have seen consistent performance and significant speed gains (x2).

Questions

What could be the reason behind this large drop in performance?
Is the performance affected by GPU architecture? My ultimate target is Xavier AGX
Is performance linked to compatibility between torch-tensorrt, CUDA and CUDNN?
has the warning message shown above got to do anything with performance?

Best Regards Sambit

Aug 24 '22 10:08 SM1991CODES

What could be the reason behind this large drop in performance?

Could be a number of reasons hard to say for sure. Could be that casting in the TRT network is especially slow and perhaps precasting upstream would help. You can compare the built engines in depth using tools like trex https://developer.nvidia.com/blog/exploring-tensorrt-engines-with-trex/

Is the performance affected by GPU architecture? My ultimate target is Xavier AGX

Yes. Engines built on different hardware will perform different. Also you must build your engine on your target deployment hardware since TensorRT tunes specifically for a specific GPU.

Is performance linked to compatibility between torch-tensorrt, CUDA and CUDNN?

Using different library versions than the specified ones may effect performance based on what kernels are selected but usually this is negligible compared to other factors

has the warning message shown above got to do anything with performance?

This would not be the first thing I look at when it comes to performance tuning. But if you have exhausted everything else its something worth trying.

Aug 24 '22 22:08 narendasan

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Nov 23 '22 00:11 github-actions[bot]

TensorRT TensorRT copied to clipboard

How to correctly format input for Fp16 inference using torch-tensorrt C++

❓ Question

What you have already tried

Environment

Additional context

TensorRT
TensorRT copied to clipboard