FasterTransformer CUDA memory is not released after inference

Branch/Tag/Commit

Based on V5.3, and merged https://github.com/NVIDIA/FasterTransformer/commit/e2dd1641880840db76b8902b34106c85b026a0af to solve early_stop

Docker Image Version

Refer to fastertransformer_backend, build triton image triton_with_ft:22.12

GPU name

A30

CUDA Driver

510.47.03

MODEL

huggingface bloom-560m, use huggingface_bloom_convert.py to convert pytorch ckpt2binary, tp=1, dt=fp32 repo https://github.com/triton-inference-server/fastertransformer_backend/tree/v1.4/all_models/bloom

Reproduced Steps

docker run -it --rm --gpus=all --shm-size=1g --ulimit memlock=-1 -v /data:/data --net=host triton_with_ft:22.12 bash
CUDA_VISIBLE_DEVICES=0 tritonserver --model-repository=/data/bloom --strict-model-config=false --log-verbose=0 --metrics-port=6800 --grpc-port=6801 After loaded the ensemble model, gpu memory used 1854MiB

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:35:00.0 Off |                    0 |
| N/A   38C    P0    34W / 165W |   1854MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

do infer use test.py as below: 'python3 test.py'

import argparse
import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient.utils import np_to_triton_dtype

def prepare_tensor(name, input):
    t = grpcclient.InferInput(
        name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--output-len',
                        type=int,
                        default=45,
                        required=False,
                        help='the number of tokens we hope model generating')
    FLAGS = parser.parse_args()

    url = "localhost:6801"
    model_name = "ensemble"
    with grpcclient.InferenceServerClient(url) as client:
        input0 = []
        stop_words_list = []
        bad_words_list = []
        for i in range(128):
            input0.append(['A triumphal arch is a free-standing monumental structure in the shape of an archway with one or more arched passageways, often designed to span a road. In its simplest form a triumphal arch consists of two massive piers connected by an arch, crowned with a flat entablature or attic on which a statue might be mounted or which bears commemorative inscriptions. The main structure is often decorated with carvings, sculpted reliefs, and dedications. More elaborate triumphal arches may have multiple archways.'])
            bad_words_list.append([""])
            stop_words_list.append([""])
        input0_data = np.array(input0).astype(object)
        output0_len = np.ones_like(input0).astype(np.uint32) * FLAGS.output_len
        is_return_log_probs = True * np.ones([input0_data.shape[0], 1]).astype(bool)

        inputs = [
            prepare_tensor("INPUT_0", input0_data),
            prepare_tensor("INPUT_1", output0_len),
            prepare_tensor("is_return_log_probs", is_return_log_probs),
            prepare_tensor("INPUT_2", np.array(bad_words_list).astype(object)),
            prepare_tensor("INPUT_3", np.array(stop_words_list).astype(object)),
        ]
        for i in range(5):
            result = client.infer(model_name, inputs)

        output0 = result.as_numpy("OUTPUT_0")
        #print(output0)
        print(np.size(output0))

I use for i in range(128) do batching, after infer, gpu memory used 5054MiB

|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:35:00.0 Off |                    0 |
| N/A   38C    P0    33W / 165W |   6110MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

then, I change batch to 512, for i in range(512), gpu memory used 14686MiB

|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:35:00.0 Off |                    0 |
| N/A   39C    P0    34W / 165W |   18846MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

and then, I change query inputs, batch=512, gpu memory used 24062MiB

input0.append(['A triumphal arch is a free-standing monumental structure in the shape of an archway with one or more arched passageways, often designed to span a road. In its simplest form a triumphal arch consists of two massive piers connected by an arch, crowned with a flat entablature or attic on which a statue might be mounted or which bears commemorative inscriptions. The main structure is often decorated with carvings, sculpted reliefs, and dedications. More elaborate triumphal arches may have multiple archways.Triumphal arches are one of the most influential and distinctive types of architecture associated with ancient Rome. Thought to have been invented by the Romans, the Roman triumphal arch was used to commemorate victorious generals or significant public events such as the founding of new colonies, the construction of a road or bridge, the death of a member of the imperial family or the ascension of a new emperor.The survival of great Roman triumphal arches such as the Arch of Titus or the Arch of Constantine has inspired many post-Roman states and rulers, up to the present day, to erect their own triumphal arches in emulation of the Romans. Triumphal arches in the Roman style have been built in many cities around the world, most notably the Arc de Triomphe in Paris, the Narva Triumphal Arch in Saint Petersburg, or the Wellington Arch in London.'])

|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:35:00.0 Off |                    0 |
| N/A   39C    P0    34W / 165W |  24062MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Client got en error as follows:
tritonclient.utils.InferenceServerException: [StatusCode.UNIMPLEMENTED] in ensemble 'ensemble', [request id: <id_unknown>] failed to split the output tensor 'error_message' in responses: expected batch size of atleast 512 in model output, got 1

Server got a warnning:
W0508 09:47:32.308338 3166 libfastertransformer.cc:1199] response is nullptr

I think gpu memory is not released after every inference. If each query request is longer than the previous one, gpu memory utilization will continue to increase. n multiple concurrent scenarios, the memory usage increases faster until oom, when the service is unavailable.

So I modify fastertransformer/triton_backend/multi_gpu_gpt/ParallelGptTritonModel.cc, set is_free_buffer_after_forward=true, and re-execute the above procedures，gpu memory is still not released.

ISSUE

How to free gpu memory after inference？

May 08 '23 10:05 DayDayupupupup

The key of the problem is that gpu memory is not released after every inference, which can be tested using the method mentioned above, gpu memory usage keeps adding up. thank you very much.