TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Error while inferencing a TensorRt model on Triton Server

Open ManoharMarri opened this issue 3 years ago • 18 comments

Getting the below error while inferencing a TensorRT model on Triton Server hosted on Google Kubernetes Engine.

what(): Assertion mUsedAllocators.find(alloc)!=mUsedAllocators.end() && “Myelin free callback called with invalid MyelinAllocator” failed.

This seems to be a TensorRT based error.

Can you please help resolve.

ManoharMarri avatar Jul 07 '22 12:07 ManoharMarri

Do you have the model that can reproduce this error?

zerollzeng avatar Jul 07 '22 13:07 zerollzeng

Hello, has the same error:

terminate called after throwing an instance of 'nvinfer1::InternalError'
  what():  Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed. 
Aborted (core dumped)

when running trtexec tool (from https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec) on environment: ubuntu 20.04 CUDA as here: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.6.2/ubuntu2004 TensorRT as here: https://github.com/NVIDIA/TensorRT/blob/release/8.4/docker/ubuntu-20.04.Dockerfile

NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6

Running on GPUs: * NVIDIA GeForce GTX 1650 Ti * NVIDIA GeForce RTX 3090 * NVIDIA RTX A5000

Command: trtexec --onnx=model.onnx --verbose --device=0 --iterations=4096 --streams=4 --threads So, 4 streams (4 CUDA execution contexts) are running in 4 threads separately. https://github.com/NVIDIA/TensorRT/blob/main/samples/common/sampleInference.cpp#L904-L906

// When multiple streams are used, trtexec can run inference in two modes:
// (1) if inference.threads is true, then run each stream on each thread.
// (2) if inference.threads is false, then run all streams on the same thread.

Interesting, that only in --threads mode this error appears, but if run trtexec without --threads - no error happening. If no threads but N streams - all streams are performing in single thread and there is no error, if threads enabled - each single stream has its own thread and error appears.

trtexec tool was build in Debug mode (gcc-9), gdb backtrace:

Thread 10 "trtexec" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff827fc000 (LWP 537320)]
0x00007fffdf47a5c0 in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
(gdb) bt
#0  0x00007fffdf47a5c0 in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#1  0x00007fffdfe48208 in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#2  0x00007fffdfd7310b in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#3  0x00007fffdfd383bd in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#4  0x00007fffdfd88042 in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#5  0x00007fffdf479f2d in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#6  0x00007fffdf47e6b9 in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#7  0x00007fffdfb64bc1 in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#8  0x00007fffdfb6538c in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.8
#9  0x000055555569eb30 in nvinfer1::IExecutionContext::enqueueV2 (this=0x5555745e2c00, bindings=0x555573e56b20, stream=0x7fff0c000e40, inputConsumed=0x0)
    at /usr/include/x86_64-linux-gnu/NvInferRuntime.h:2358
#10 0x000055555568e9ed in sample::(anonymous namespace)::EnqueueExplicit::operator() (this=0x7fff0c000c78, stream=...)
    at /source/trtexec/common/sampleInference.cpp:428
#11 0x0000555555697c37 in std::_Function_handler<bool(sample::TrtCudaStream&), sample::(anonymous namespace)::EnqueueExplicit>::_M_invoke(const std::_Any_data &, sample::TrtCudaStream &) (
    __functor=..., __args#0=...) at /usr/include/c++/9/bits/std_function.h:285
#12 0x00005555556b88a1 in std::function<bool (sample::TrtCudaStream&)>::operator()(sample::TrtCudaStream&) const (this=0x7fff0c000c78, __args#0=...) at /usr/include/c++/9/bits/std_function.h:688
#13 0x0000555555694daa in sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext>::query (this=0x7fff0c000c60, skipTransfers=false)
    at /source/trtexec/common/sampleInference.cpp:604
#14 0x00005555556932f4 in sample::(anonymous namespace)::inferenceLoop<nvinfer1::IExecutionContext> (iStreams=std::vector of length 1, capacity 1 = {...}, cpuStart=..., gpuStart=..., iterations=4096, 
    maxDurationMs=3200, warmupMs=200, trace=std::vector of length 44, capacity 64 = {...}, skipTransfers=false, idleMs=0)
    at /source/trtexec/common/sampleInference.cpp:820
#15 0x000055555569190c in sample::(anonymous namespace)::inferenceExecution<nvinfer1::IExecutionContext> (inference=..., iEnv=..., sync=..., threadIdx=3, streamsPerThread=1, device=0, 
    trace=std::vector of length 0, capacity 0, log=...) at /source/trtexec/common/sampleInference.cpp:878
#16 0x000055555569df1a in std::__invoke_impl<void, void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace>&, sdk::log_t&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >, std::reference_wrapper<sdk::log_t> > (
    __f=@0x55556a0d8160: 0x555555691653 <sample::(anonymous namespace)::inferenceExecution<nvinfer1::IExecutionContext>(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int32_t, int32_t, int32_t, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&, sdk::log_t&)>) at /usr/include/c++/9/bits/invoke.h:60
#17 0x000055555569dd0f in std::__invoke<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace>&, sdk::log_t&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >, std::reference_wrapper<sdk::log_t> > (
    __fn=@0x55556a0d8160: 0x555555691653 <sample::(anonymous namespace)::inferenceExecution<nvinfer1::IExecutionContext>(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int32_t, int32_t, int32_t, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&, sdk::log_t&)>) at /usr/include/c++/9/bits/invoke.h:95
#18 0x000055555569daeb in std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&, sdk::log_t&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >, std::reference_wrapper<sdk::log_t> > >::_M_invoke<0, 1, 2, 3, 4, 5, 6, 7, 8> (this=0x55556a0d8128) at /usr/include/c++/9/thread:244
#19 0x000055555569d9e2 in std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&, sdk::log_t&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >, std::reference_wrapper<sdk::log_t> > >::operator() (this=0x55556a0d8128) at /usr/include/c++/9/thread:251
#20 0x000055555569d9c6 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&, sdk::log_t&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >, std::reference_wrapper<sdk::log_t> > > >::_M_run (this=0x55556a0d8120) at /usr/include/c++/9/thread:195
#21 0x00007fffdb425de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007ffff7f83609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007fffdb112133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Our developers team has two similar models with SWIN architecture and them both have this error in runtime.

GremSnoort avatar Jul 12 '22 17:07 GremSnoort

@nvpohanh Looks like a real bug here.

zerollzeng avatar Jul 13 '22 12:07 zerollzeng

Hi,

I am experiencing similar issues in Python when using TensoRT (8.4.1.5) in a multi threaded fashion. Running inference in a single thread works just fine, however multiple will fail with: terminate called after throwing an instance of 'nvinfer1::InternalError' what(): Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed. Interestingly, the first couple inference steps work, e.g. with 4 threads I usually get to do ~200 steps before abort happens.

nkiehne avatar Jul 17 '22 15:07 nkiehne

Same error, same behaviour as described before. It's a blocker for our project, is it possible to fix this?

napest avatar Jul 19 '22 18:07 napest

Can you provide a onnx models or sample that can reproduce this error? I can file an internal bug to track this.

zerollzeng avatar Jul 21 '22 07:07 zerollzeng

Can you provide a onnx models or sample that can reproduce this error? I can file an internal bug to track this.

https://drive.google.com/drive/u/0/folders/1GyO9rtMkafBahOz3zn9yEOJ2fPooxd6i - link to onnx model

Dantes4u avatar Jul 21 '22 12:07 Dantes4u

Here is some more insight into the error from the python side of things. Question is: Should I rather open a new issue?

Steps to reproduce: Download ONNX model (distilgpt2) here: https://drive.google.com/file/d/1hrBClBCCRg_UR4bgbPa7lKXfmrxf7HG3/view?usp=sharing

Build engine with trtexec:

~/TensorRT-8.4.1.5/bin/trtexec --onnx=model.onnx --optShapes=input_ids:1x128 --minShapes=input_ids:1x1 --maxShapes=input_ids:1 x512 --saveEngine=model_fp32.plan --buildOnly

Here is some minimal code for the Thread obj:

import logging
import time
from threading import Thread

import tensorrt as trt
from tensorrt import ICudaEngine
from tensorrt.tensorrt import Logger, Runtime
import numpy as np
from tqdm.auto import tqdm

from pycuda import driver
from pycuda import autoinit

class TrtModel(Thread):
    def __init__(self, engine, device=0, runtime=None):
        super().__init__(daemon=True)
        self.device = device
        self.engine = engine

        if runtime is None:
            self.logger =  trt.Logger(trt.Logger.INFO)
            self.runtime = trt.Runtime(self.logger)
        else:
            self.runtime = runtime
            self.logger = runtime.logger


    def run(self):
        dev = driver.Device(self.device)
        self._ctx = dev.make_context()

        self.stream = driver.Stream()
        with open(self.engine, "rb") as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())
        print(self, "Engine read")
        self.context = self.engine.create_execution_context()
        print(self, "Context created")

        # add dummy data
        self.context.set_binding_shape(0, (1,95))

        inputs_cpu = driver.pagelocked_empty(1*95, np.int32)
        inputs_gpu = driver.mem_alloc(inputs_cpu.nbytes)

        outputs_cpu = driver.pagelocked_empty(1*95*50272, np.float32)
        outputs_gpu = driver.mem_alloc(outputs_cpu.nbytes)

        bindings = [int(inputs_gpu), int(outputs_gpu)]

        # run dummy inference
        # Note: disabling memcpy does not do anything
        for i in tqdm(range(10000)):
            self._ctx.push()
            driver.memcpy_htod_async(inputs_gpu, inputs_cpu, self.stream)

            r = self.context.execute_async_v2(bindings, self.stream.handle)

            driver.memcpy_dtoh_async(outputs_cpu, outputs_gpu, self.stream)
            self.stream.synchronize()
            self._ctx.pop()
        
        
        print(self, "cleaning up")
        # we need to manually free every gpu resources before detaching pycuda context
        # Somehow this is an issue...
        inputs_gpu.free()
        outputs_gpu.free()
        del self.context, self.engine
        self._ctx.detach()
        print("done")

Invoked like here:

trt_logger: Logger = trt.Logger(trt.Logger.INFO)
runtime: Runtime = trt.Runtime(trt_logger)

num_gpus = 8
engine_path = "data/models/distilgpt2/model_fp32.plan"
engines = [TrtModel(engine_path, device=i, runtime=runtime) for i in range(num_gpus)]


for x in engines:
    x.start()

Make sure to run on at least 4+ gpus, otherwise it will take some time until the kernel dies. E.g. for 2 gpus, the 10k inference steps ran just fine, whereas 4 crashed ~8k steps in and 8 gpus got until 1.5k...

nkiehne avatar Jul 21 '22 12:07 nkiehne

@Dantes4u How can I reproduce it? I can't reproduce it using TRT 8.4.1.5

[07/22/2022-08:01:17] [I] === Performance summary ===
[07/22/2022-08:01:17] [I] Throughput: 546.168 qps
[07/22/2022-08:01:17] [I] Latency: min = 2.55664 ms, max = 9.1418 ms, mean = 7.03175 ms, median = 7.40112 ms, percentile(99%) = 7.92065 ms
[07/22/2022-08:01:17] [I] Enqueue Time: min = 2.54883 ms, max = 9.07617 ms, mean = 7.00071 ms, median = 7.36902 ms, percentile(99%) = 7.88477 ms
[07/22/2022-08:01:17] [I] H2D Latency: min = 0.0527344 ms, max = 0.332031 ms, mean = 0.0860807 ms, median = 0.0875244 ms, percentile(99%) = 0.105469 ms
[07/22/2022-08:01:17] [I] GPU Compute Time: min = 2.50195 ms, max = 9.0508 ms, mean = 6.9321 ms, median = 7.29883 ms, percentile(99%) = 7.81812 ms
[07/22/2022-08:01:17] [I] D2H Latency: min = 0.00195312 ms, max = 0.268066 ms, mean = 0.0135467 ms, median = 0.0107422 ms, percentile(99%) = 0.0585938 ms
[07/22/2022-08:01:17] [I] Total Host Walltime: 29.9981 s
[07/22/2022-08:01:17] [I] Total GPU Compute Time: 113.576 s
[07/22/2022-08:01:17] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/22/2022-08:01:17] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/22/2022-08:01:17] [W] * GPU compute time is unstable, with coefficient of variance = 9.88314%.
[07/22/2022-08:01:17] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/22/2022-08:01:17] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/22/2022-08:01:17] [V]
[07/22/2022-08:01:17] [V] === Explanations of the performance metrics ===
[07/22/2022-08:01:17] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/22/2022-08:01:17] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/22/2022-08:01:17] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/22/2022-08:01:17] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/22/2022-08:01:17] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/22/2022-08:01:17] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/22/2022-08:01:17] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/22/2022-08:01:17] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/22/2022-08:01:17] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # ./trtexec --onnx=swin.onnx --verbose --device=0 --iterations=4096 --streams=4 --threads

zerollzeng avatar Jul 22 '22 15:07 zerollzeng

@GremSnoort Can you share your onnx model?

zerollzeng avatar Jul 22 '22 15:07 zerollzeng

Can you provide a onnx models or sample that can reproduce this error? I can file an internal bug to track this.

https://drive.google.com/drive/u/0/folders/1GyO9rtMkafBahOz3zn9yEOJ2fPooxd6i - link to onnx model

@zerollzeng Model swin.onnx from @Dantes4u reproduced for me the same error.

  • Models from me to reproduce:
    • https://drive.google.com/file/d/1uF6LT99RMeWaXjjgJAiqBPHZyMi7t_rN/view?usp=sharing
    • https://drive.google.com/file/d/1M8dbuTzBmv2_t16fsVuZMLBn9wixZm0z/view?usp=sharing

Steps to reproduce as it works for me:

  • Here is my own prod image Ubuntu 20.04 with CUDA 11.6.2 TensorRT 8.4.1.5 : https://hub.docker.com/r/gremsnoort/ubuntu_2004_tensorrt_devel/tags based on Dockerfile https://github.com/NVIDIA/TensorRT/blob/release/8.4/docker/ubuntu-20.04.Dockerfile
  • Run container (volume for models) : docker run -it --gpus all --volume /onnx/:/onnx/ gremsnoort/ubuntu_2004_tensorrt_devel:8.4.1.5 bash
  • I build trtexec tool from sources from https://github.com/NVIDIA/TensorRT :
    • Checkouted release/8.4 branch
  • Run gdb --args ./trtexec --onnx=/onnx/swin.onnx --verbose --device=0 --iterations=4096 --streams=4 --threads
  • Wait some time (different for different models) and have an error:
[07/22/2022-18:03:42] [I] Starting inference
[New Thread 0x7f378a3cf000 (LWP 319)]
[New Thread 0x7f378895b000 (LWP 320)]
[New Thread 0x7f378815a000 (LWP 321)]
[New Thread 0x7f3787959000 (LWP 322)]
terminate called after throwing an instance of 'nvinfer1::InternalError'
 what():  Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed. 

Thread 6 "trtexec" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f378a3cf000 (LWP 319)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f37d8438859 in __GI_abort () at abort.c:79
#2  0x00007f37d8810911 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f37d881c38c in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f37d881b369 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f37d881bd21 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f37d8618bef in ?? () from /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007f37d8619281 in _Unwind_RaiseException () from /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00007f37d881c69c in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f37dc5f7aa5 in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#10 0x00007f37dc60747c in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#11 0x00007f37dcf02b6f in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#12 0x00007f37dcebdf2d in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#13 0x00007f37dcf15472 in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#14 0x00007f37dc606fc5 in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#15 0x00007f37dc1900e0 in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#16 0x00007f37dc60c2a4 in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#17 0x00007f37dccf1bc1 in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#18 0x00007f37dccf238c in ?? () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
#19 0x000055728bfeaf32 in nvinfer1::IExecutionContext::enqueueV2(void* const*, CUstream_st*, CUevent_st**) ()
#20 0x000055728bfdf84d in sample::(anonymous namespace)::EnqueueExplicit::operator()(sample::TrtCudaStream&) const ()
#21 0x000055728bfe75fa in std::_Function_handler<bool (sample::TrtCudaStream&), sample::(anonymous namespace)::EnqueueExplicit>::_M_invoke(std::_Any_data const&, sample::TrtCudaStream&) ()
#22 0x000055728bff5d73 in std::function<bool (sample::TrtCudaStream&)>::operator()(sample::TrtCudaStream&) const ()
#23 0x000055728bfe527f in sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext>::query(bool) ()
#24 0x000055728bfe3d80 in bool sample::(anonymous namespace)::inferenceLoop<nvinfer1::IExecutionContext>(std::vector<std::unique_ptr<sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext>, std::default_delete<sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext> > >, std::allocator<std::unique_ptr<sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext>, std::default_delete<sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext> > > > >&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, sample::TrtCudaEvent const&, int, float, float, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&, bool, float) ()
#25 0x000055728bfe2577 in void sample::(anonymous namespace)::inferenceExecution<nvinfer1::IExecutionContext>(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&) ()
#26 0x000055728bfea365 in void std::__invoke_impl<void, void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > > >(std::__invoke_other, void (*&&)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>&&, std::reference_wrapper<sample::InferenceEnvironment>&&, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>&&, int&&, int&&, int&&, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >&&) ()
#27 0x000055728bfea18b in std::__invoke_result<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > > >::type std::__invoke<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > > >(void (*&&)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>&&, std::reference_wrapper<sample::InferenceEnvironment>&&, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>&&, int&&, int&&, int&&, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > >&&) ()
#28 0x000055728bfe9fa0 in void std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > > > >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul>) ()
#29 0x000055728bfe9eb6 in std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > > > >::operator()() ()
#30 0x000055728bfe9e9a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(sample::InferenceOptions const&, sample::InferenceEnvironment&, sample::(anonymous namespace)::SyncStruct&, int, int, int, std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> >&), std::reference_wrapper<sample::InferenceOptions const>, std::reference_wrapper<sample::InferenceEnvironment>, std::reference_wrapper<sample::(anonymous namespace)::SyncStruct>, int, int, int, std::reference_wrapper<std::vector<sample::InferenceTrace, std::allocator<sample::InferenceTrace> > > > > >::_M_run() ()
#31 0x00007f37d8848de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#32 0x00007f37d895c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#33 0x00007f37d8535133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Imho, the problem is in nvinfer1::IExecutionContext::enqueueV2(void* const*, CUstream_st*, CUevent_st**) (Asynchronously execute inference) on multithreaded case, because, for my code, executeV2 (Synchronously execute inference a network) instead of enqueueV2 simply fixed the problem. But I strictly need to use asynchronous version of this.

GremSnoort avatar Jul 22 '22 18:07 GremSnoort

@zerollzeng Were you able to reproduce this error with other models following the steps described above?

napest avatar Jul 27 '22 09:07 napest

I suspect this might be an env setup issue. we just release the official docker image 22.07, which is shipped with TRT 8.4.1. Anyone can try to reproduce it on this container? if it still not work I will file a internal bug to track this.

refer to https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_22-07.html#rel_22-07

zerollzeng avatar Jul 30 '22 02:07 zerollzeng

@zerollzeng Error is reproducing. According to https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_22-07.html#rel_22-07 and https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt, use image:

docker pull nvcr.io/nvidia/tensorrt:22.07-py3
  • Build the samples by running make in the /workspace/tensorrt/samples directory.

  • Running on some different machines with configurations:

Driver Version                            : 515.48.07
CUDA Version                              : 11.7
Product Name                          : NVIDIA GeForce GTX 1050 Ti
Driver Version                            : 515.48.07
CUDA Version                              : 11.7
Product Name                          : NVIDIA GeForce RTX 3060
Driver Version                            : 515.48.07
CUDA Version                              : 11.7
Product Name                          : NVIDIA GeForce RTX 2060
  • Run from resulting executables /workspace/tensorrt/bin directory:
gdb -args ./trtexec_debug --onnx=/onnx/basemodel3_22.onnx --verbose --device=0 --iterations=4096 --streams=4 --threads

or

gdb --args ./trtexec --onnx=/onnx/basemodel3_22.onnx --verbose --device=0 --iterations=4096 --streams=4 --threads
  • Catch the same error:
[08/01/2022-10:06:12] [I] Starting inference
[New Thread 0x7f6adf7fe000 (LWP 6829)]
[New Thread 0x7f6add187000 (LWP 6830)]
[New Thread 0x7f6adc986000 (LWP 6831)]
[New Thread 0x7f6a8da2d000 (LWP 6832)]
terminate called after throwing an instance of 'nvinfer1::InternalError'
terminate called recursively
  what():  Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed. 

Thread 6 "trtexec_debug" received signal SIGABRT, Aborted.
...
#19 0x000055e5d3756021 in nvinfer1::IExecutionContext::enqueueV2 (this=0x55e645464ed0, bindings=0x55e5ef28ea50, stream=0x7fb594000df0, inputConsumed=0x0) at /usr/include/x86_64-linux-gnu/NvInferRuntime.h:2358
#20 0x000055e5d374a1b9 in sample::(anonymous namespace)::EnqueueExplicit::operator() (this=0x7fb594000c78, stream=...) at ../common/sampleInference.cpp:425
#21 0x000055e5d37522bd in std::_Function_handler<bool(sample::TrtCudaStream&), sample::(anonymous namespace)::EnqueueExplicit>::_M_invoke(const std::_Any_data &, sample::TrtCudaStream &) (__functor=..., __args#0=...) at /usr/include/c++/9/bits/std_function.h:285
#22 0x000055e5d37631e3 in std::function<bool (sample::TrtCudaStream&)>::operator()(sample::TrtCudaStream&) const (this=0x7fb594000c78, __args#0=...) at /usr/include/c++/9/bits/std_function.h:688
#23 0x000055e5d374fe77 in sample::(anonymous namespace)::Iteration<nvinfer1::IExecutionContext>::query (this=0x7fb594000c60, skipTransfers=false) at ../common/sampleInference.cpp:589
#24 0x000055e5d374e706 in sample::(anonymous namespace)::inferenceLoop<nvinfer1::IExecutionContext> (iStreams=std::vector of length 1, capacity 1 = {...}, cpuStart=..., gpuStart=..., iterations=4096, maxDurationMs=3200, warmupMs=200, 
    trace=std::vector of length 671, capacity 1024 = {...}, skipTransfers=false, idleMs=0) at ../common/sampleInference.cpp:797

P.S. Model SWIN during engine generation shows a large latency on this layer:

[08/01/2022-09:53:54] [V] [TRT] --------------- Timing Runner: {ForeignNode[Reshape_8 + Transpose_9...(Unnamed Layer* 1474) [ElementWise]]} (Myelin)

GremSnoort avatar Aug 01 '22 10:08 GremSnoort

I can reproduce it using https://drive.google.com/file/d/1uF6LT99RMeWaXjjgJAiqBPHZyMi7t_rN/view?usp=sharing, will file an internal bug to track this, btw if I don't use gdb, then the error won't show up.

zerollzeng avatar Aug 03 '22 02:08 zerollzeng

@zerollzeng for me this error reproduces the same with gdb or not and for both versions of trtexec: trtexec_debug and trtexec. Yes, it may be reproducing not every time but more than twice in a row for me.

GremSnoort avatar Aug 03 '22 13:08 GremSnoort

@zerollzeng Hi, want to know if specifying computecapacity is supported when using trtexec convertion (from say, onnx)? It's really painful when I'm trying to migrate a trt model to another machine (typically with lower computecapacity) Yet I can't do trtexec convertion locally since I want batch supported and it often requireds a lot of memory.

leemengwei avatar Aug 09 '22 13:08 leemengwei

It's really painful when I'm trying to migrate a trt model to another machine (typically with lower computecapacity) -> TRT doesn't allow it, you won't get the best performance even running into errors.

zerollzeng avatar Aug 10 '22 14:08 zerollzeng

Hi @zerollzeng , Is there any way to avoid this problem?

duduscript avatar Dec 09 '22 03:12 duduscript

@duduscript Did you try the latest release? e.g. 8.5 GA? Is it still reproducible? if yes can you provide the ONNX model to us? thanks!

zerollzeng avatar Dec 11 '22 12:12 zerollzeng

@zerollzeng I test it in 8.5 GA and this problem is gone, thanks.

duduscript avatar Dec 23 '22 06:12 duduscript

should be fixed, if you still see this error, please reopen this bug. thanks!

zerollzeng avatar Dec 23 '22 15:12 zerollzeng