TensorRT
TensorRT copied to clipboard
Can't run network with INonZeroLayer on TensorRT 8.6.1.6 and GPU NVIDIA GeForce RTX 3060
Description
- Create TensorRT network with INonZeroLayer from scratch. Save TensorRT engine.
- Deserialize cuda engine and try to create execution context.
- Execution context = nullptr.
Code:
#include <NvInfer.h>
using namespace nvinfer1;
#include "cuda_runtime_api.h"
#include <iostream>
#include <fstream>
#include <assert.h>
class Logger : public ILogger
{
void log(Severity severity, const char* msg) noexcept override
{
std::cout << msg << std::endl;
}
} logger;
int main()
{
{
IBuilder* builder = createInferBuilder(logger);
INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
Dims dims;
dims.nbDims = 1;
dims.d[0] = 32;
ITensor& input = *network->addInput("input", DataType::kINT32, dims);
auto nzLayer = network->addNonZero(input);
ITensor& output = *nzLayer->getOutput(0);
output.setName("output");
network->markOutput(output);
IBuilderConfig* config = builder->createBuilderConfig();
config->setMaxWorkspaceSize(1 << 20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
std::cout << "Serializing model" << std::endl;
IHostMemory *serializedModel = engine->serialize();
std::cout << "Model serialized" << std::endl;
std::ofstream p("../dynamic.engine", std::ios::binary);
if (!p)
{
std::cerr << "could not create engine file" << std::endl;
return -1;
}
p.write(reinterpret_cast<const char*>(serializedModel->data()), serializedModel->size());
std::cout << "Engine file written" << std::endl;
delete network;
delete config;
delete builder;
delete serializedModel;
delete engine;
}
std::ifstream file("../dynamic.engine", std::ios::binary);
if (!file.good())
{
std::cout << "Engine is not good" << std::endl;
}
IRuntime* runtime = createInferRuntime(logger);
assert(runtime);
file.seekg(0, file.end);
auto size = file.tellg();
file.seekg(0, file.beg);
auto trt_model_stream = new char[size];
file.read(trt_model_stream, size);
file.close();
ICudaEngine* engine = runtime->deserializeCudaEngine(trt_model_stream, size);
assert(engine);
delete[] trt_model_stream;
IExecutionContext* context = engine->createExecutionContext();
assert(context);
return 0;
}
Trace:
[MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 15, GPU 641 (MiB)
Trying to load shared library libnvinfer_builder_resource.so.8.6.1
Loaded shared library libnvinfer_builder_resource.so.8.6.1
[MemUsageChange] Init builder kernel library: CPU +1449, GPU +252, now: CPU 1541, GPU 879 (MiB)
CUDA lazy loading is enabled.
Original: 2 layers
After dead-layer removal: 2 layers
Graph construction completed in 0.000297353 seconds.
After Myelin optimization: 2 layers
Applying ScaleNodes fusions.
After scale fusion: 2 layers
After dupe layer removal: 2 layers
After final dead-layer removal: 2 layers
After tensor merging: 2 layers
After vertical fusions: 2 layers
After dupe layer removal: 2 layers
After final dead-layer removal: 2 layers
After tensor merging: 2 layers
After slice removal: 2 layers
After concat removal: 2 layers
Trying to split Reshape and strided tensor
Graph optimization time: 0.000145561 seconds.
Building graph using backend strategy 2
Local timing cache in use. Profiling results in this builder pass will not be stored.
Constructing optimization profile number 0 [1/1].
Applying generic optimizations to the graph for inference.
Reserving memory for host IO tensors. Host: 0 bytes
=============== Computing costs for (Unnamed Layer* 0) [NonZero]
*************** Autotuning format combination: Int32(1) -> Int32((# 0 (VALUE (Unnamed Layer* 0) [NonZero][size])),1), Int32() ***************
--------------- Timing Runner: (Unnamed Layer* 0) [NonZero] (NonZero[0x80000033])
Tactic: 0x0000000000000000 Time: 0.0134912
(Unnamed Layer* 0) [NonZero] (NonZero[0x80000033]) profiling completed in 0.00521287 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.0134912
>>>>>>>>>>>>>>> Chose Runner Type: NonZero Tactic: 0x0000000000000000
=============== Computing costs for (Unnamed Layer* 0) [NonZero][size][DevicetoShapeHostCopy]
*************** Autotuning format combination: Int32() -> ***************
=============== Computing reformatting costs
=============== Computing reformatting costs
=============== Computing reformatting costs
=============== Computing reformatting costs
Formats and tactics selection completed in 0.00554525 seconds.
After reformat layers: 2 layers
Total number of blocks in pre-optimized block assignment: 2
Detected 1 inputs and 1 output network tensors.
Layer: (Unnamed Layer* 0) [NonZero] Host Persistent: 0 Device Persistent: 0 Scratch Memory: 771
Skipped printing memory information for 1 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
Total Host Persistent Memory: 0
Total Device Persistent Memory: 0
Total Scratch Memory: 771
[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[BlockAssignment] Algorithm ShiftNTopDown took 0.007542ms to assign 2 blocks to 2 nodes requiring 1536 bytes.
Total number of blocks in optimized block assignment: 2
Total Activation Memory: 1536
Total number of generated kernels selected for the engine: 0
Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
Disabling unused tactic source: JIT_CONVOLUTIONS
Engine generation completed in 0.00749379 seconds.
Engine Layer Information:
Layer(NonZero): (Unnamed Layer* 0) [NonZero], Tactic: 0x0000000000000000, input (Int32[32]) -> output (Int32[1,-1]), (Unnamed Layer* 0) [NonZero][size] (Int32[])
Layer(DeviceToShapeHost): (Unnamed Layer* 0) [NonZero][size][DevicetoShapeHostCopy], Tactic: 0x0000000000000000, (Unnamed Layer* 0) [NonZero][size] (Int32[]) ->
[MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
Serializing model
Adding 1 engine(s) to plan file.
Model serialized
Engine file written
Loaded engine size: 0 MiB
Deserialization required 300 microseconds.
[MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
Total per-runner device persistent memory is 0
Total per-runner host persistent memory is 0
Allocated activation device memory of size 1536
1: Unexpected exception vector<bool>::_M_range_check: __n (which is 0) >= this->size() (which is 0)
spconv_deploy: /home/[email protected]/Projects/spconv_deploy/spconv_deploy.cpp:87: int main(): Assertion `context' failed.
Aborted (core dumped)
Environment
TensorRT 8.6.1.6
NVIDIA GeForce RTX 3060
CUDA 11.1
CUDNN 8.9.0.131
Ubuntu 20.04
Checking
closing this is duplicate to https://github.com/NVIDIA/TensorRT/issues/3335, thanks!