TensorRT standalone mode bug for dla

Description

failed to serialize the network for NVDLA standalone mode, it gives the crash log

DLA Cores: 2
4: [standardEngineBuilder.cpp::engineValidationForSafeDLAMode::1519] Error Code 4: Internal Error (Safe DLA is enabled but not all layers are running on DLA.)
Segmentation fault (core dumped)

Environment

TensorRT Version 8.2: NVIDIA GPU drive orin: NVIDIA Driver Version: CUDA Version 11.4: CUDNN Version 8.3.2: Operating System ubuntu: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

#include <iostream>
#include <cassert>
#include <fstream>

#include "NvInfer.h"

using namespace nvinfer1;

class Logger : public ILogger {
  void log(Severity severity, const char* msg) noexcept override {
    // suppress info-level messages
    if (severity <= Severity::kWARNING) std::cout << msg << std::endl;
  }
} logger;

int main() {
  IRuntime* rt = createInferRuntime(logger);
  std::cout << "DLA Cores: " << rt->getNbDLACores() << std::endl;

  IBuilder* builder = createInferBuilder(logger);
  builder->setMaxBatchSize(1);
  IBuilderConfig* config = builder->createBuilderConfig();
  config->setFlag(BuilderFlag::kFP16);
  config->setMaxWorkspaceSize(1024 * 1024 * 1024);
  config->setDefaultDeviceType(DeviceType::kDLA);
  config->setDLACore(0);
  // config->setFlag(BuilderFlag::kGPU_FALLBACK);
  config->setEngineCapability(EngineCapability::kDLA_STANDALONE);
  INetworkDefinition* network = builder->createNetworkV2(0);
  Dims32 dim32{4, {1, 32, 32, 32}};
  ITensor *input = network->addInput("input", DataType::kHALF, dim32);
  input->setAllowedFormats(TensorFormats(TensorFormat::kCHW16));

  ILayer *relu = network->addActivation(*input, ActivationType::kRELU);
  // ILayer *relu = network->addUnary(*input, UnaryOperation::kABS);
  relu->setName("relu");
  // relu->setOutputType(0, DataType::kHALF);
  assert(config->canRunOnDLA(relu));

  ITensor *output = relu->getOutput(0);
  output->setName("output");
  network->markOutput(*output);
  output->setType(DataType::kHALF);
  output->setAllowedFormats(TensorFormats(TensorFormat::kCHW16));

  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
  IHostMemory *serialized = engine->serialize();
  assert(serialized);
  std::ofstream p("relu.trt");
  p.write((const char *)serialized->data(), serialized->size());

  // IHostMemory *plan = builder->buildSerializedNetwork(*network, *config);

  return 0;
}

Jul 29 '22 10:07 ItIsFriday

If you enable the verbose log

class Logger : public ILogger {
  void log(Severity severity, const char* msg) noexcept override {
    std::cout << msg << std::endl;
  }
} logger;

You should be able to see log like

>>>>>>>>>>>>>>> Chose Runner Type: DLA Tactic: 0x0000000000000003
Adding reformat layer: Reformatted Input Tensor 0 to {ForeignNode[relu]} (input) from Half(4096,4096,1:8,128,4) to Half(32768,32768,1024,32,1)
Adding reformat layer: Reformatted Output Tensor 0 to {ForeignNode[relu]} (output) from Half(2048,2048,1024:16,32,1) to Half(4096,4096,1:8,128,4)
Formats and tactics selection completed in 0.0722914 seconds.
After reformat layers: 3 layers
Pre-optimized block assignment.
Block size 1073741824
Total Activation Memory: 1073741824
Detected 2 NvMedia tensors.
Layer: Reformatting CopyNode for Input Tensor 0 to {ForeignNode[relu]} Host Persistent: 0 Device Persistent: 0 Scratch Memory: 0
Layer: {ForeignNode[relu]} Host Persistent: 848 Device Persistent: 0 Scratch Memory: 0
Layer: Reformatting CopyNode for Output Tensor 0 to {ForeignNode[relu]} Host Persistent: 0 Device Persistent: 0 Scratch Memory: 0

Which means TRT still insert 2 reformat layers to the engine and that's why you see the error.

Jul 30 '22 03:07 zerollzeng

to solve this, you need to enable all allowed formats for DLA FP16 for the input and the output. refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#restrictions-with-dla

modify your code with

...
input->setAllowedFormats(TensorFormats(1U << static_cast<int>(TensorFormat::kCHW16) | 1U << static_cast<int>(TensorFormat::kDLA_HWC4) | 1U << static_cast<int>(TensorFormat::kDLA_LINEAR)));
...
output->setAllowedFormats(TensorFormats(1U << static_cast<int>(TensorFormat::kCHW16) | 1U << static_cast<int>(TensorFormat::kDLA_LINEAR)));
...

Jul 30 '22 03:07 zerollzeng

@zerollzeng thx for your answer. it works for me!

Jul 30 '22 07:07 ItIsFriday

closing since no activity for more than 14 days, please reopen if you still have question, thanks!

Dec 12 '22 07:12 ttyio