fastertransformer_backend
fastertransformer_backend copied to clipboard
FT backend crashes Triton server if batch size is too large
Description
Branch: main
Docker version: 22.03
GPU type: 2x NVIDIA RTX A6000
Reproduced Steps
- Load a model with the fastertransformer backend.
- Make a query with a batch size that is too large for GPU memory.
The server crashes with:
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:26
[gv013:3677168] *** Process received signal ***
[gv013:3677168] Signal: Aborted (6)
[gv013:3677168] Signal code: (-6)
[gv013:3677168] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x1472e53a8420]
[gv013:3677168] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x1472e3d9c00b]
[gv013:3677168] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x1472e3d7b859]
[gv013:3677168] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x1472e4155911]
[gv013:3677168] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x1472e416138c]
[gv013:3677168] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x1472e41613f7]
[gv013:3677168] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x1472e41616a9]
[gv013:3677168] [ 7] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer5checkI9cudaErrorEEvT_PKcS4_i+0x219)[0x147265e83ab9]
[gv013:3677168] [ 8] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer12deviceMallocI6__halfEEvPPT_ib+0x36)[0x147265ff6146]
[gv013:3677168] [ 9] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer10GptJWeightI6__halfE13mallocWeightsEv+0x60)[0x147265eccc40]
[gv013:3677168] [10] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer10GptJWeightI6__halfEC2Eiiiiiiiii+0x148)[0x147265ed05e8]
[gv013:3677168] [11] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN15GptJTritonModelI6__halfE19createModelInstanceEiiP11CUstream_stSt4pairISt6vectorIP8ncclCommSaIS7_EES9_ESt10shared_ptrIN17fastertransformer18AbstractCustomCommEE+0x3f7)[0x147265ec23e7]
[gv013:3677168] [12] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x16eb3)[0x1472daa44eb3]
[gv013:3677168] [13] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x173c2)[0x1472daa453c2]
[gv013:3677168] [14] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2b23d)[0x1472daa5923d]
[gv013:3677168] [15] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x1472e418dde4]
[gv013:3677168] [16] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x1472e539c609]
[gv013:3677168] [17] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x1472e3e78133]
[gv013:3677168] *** End of error message ***
It would be better if the FT backend just detected the out of memory condition and returned an error code for the request, rather than raising an assertion that crashes the whole server.