Ryan McCormick
Ryan McCormick
Hi @Hassan313 , Sorry for the delay. Few questions come to mind: 1. Are the inference results correct on fp32/fp16 engines? If yes, then probably an int8 calibration issue. If...
CC @dyastremsky since @jbkyang-nvi is on vacation
CC @szalpal
Hi @lminer , Please share the full error output/log you're getting for this issue. Also, please share the version of Triton you're using, GPU type, and other [issue template](https://github.com/triton-inference-server/server/blob/main/.github/ISSUE_TEMPLATE/bug_report.md) information....
Ah, I misread as CUDA shared memory. Can you try to isolate the error to the specific lines it is failing at and capture the traceback/exception being raised if any?...
Hi @nrepesh, Re: Tensorflow backend, it is a known limitation that TensorFlow does not release any memory it allocates until the backend is completely unloaded. There is a FAQ on...
Hi @Leelaobai , Is this a memory leak over time as new requests come in? Or do you just not have enough GPU memory for having both models loaded and...
CC @tanmayv25 @Tabrizian
Hi @jhm0104666 , Regarding this point: > The MLPerf inference result (v2.0) from NVIDIA shows that a single A100 with Triton, TensorRT gets ~20k resnet50 performance in the "server scenario"....
> It will take some time for me to run the model on Polygraphy because I didn't use that earlier. @Vinayaks117 Hopefully something like this should get you started (assuming...