Chi Lo
Chi Lo
Can you rebase it to main? main now has all the CI settings to run TRT 10
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX...
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, ONNX Runtime React Native CI Pipeline, Windows...
/azp run Linux MIGraphX CI Pipeline, orttraining-amd-gpu-ci-pipeline
> Could you please post all the CMake options that were given to CUTLASS? Without the list of CMake options, it will be a lot harder for us to try...
onnx-trt parser [filters out](https://github.com/onnx/onnx-tensorrt/blob/main/ModelImporter.cpp#L377) `NonMaxSuppression`, `NonZero`, and `RoiAlign`, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this...
Tensorrt supports [nmsplugin](nmsPlugin) and [rioAlignPlugin](https://github.com/NVIDIA/TensorRT/tree/release/8.6/plugin/roiAlignPlugin). Probably we can replace onnx NonMaxSuppression and RoiAlign nodes with those two TRT plugins to see the latency?
Here are the steps to build OSS onnx-tensorrt parser with not filtering out those operators: 1. add `--use_tensorrt_oss_parser` as one of the ORT build arguments and start building. 3. At...
I think we can try the TRT plugins. please see the doc [here](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-plugins-support). You need to modify the graph and replace `RoiAlign` and `NonMaxSuppression` with the custom ops that will...
@datinje Last time from Nvidia, they mentioned NMS and NonZero are natively supported only by `enqueueV3` (TRT EP currently uses `enqueueV2`). I am current working on a [dev branch](https://github.com/microsoft/onnxruntime/tree/chi/trt_enqueue_v3) to...