FasterTransformer
FasterTransformer copied to clipboard
CUDA memory is not released after inference
Branch/Tag/Commit
Based on V5.3, and merged https://github.com/NVIDIA/FasterTransformer/commit/e2dd1641880840db76b8902b34106c85b026a0af to solve early_stop
Docker Image Version
Refer to fastertransformer_backend, build triton image triton_with_ft:22.12
GPU name
A30
CUDA Driver
510.47.03
MODEL
huggingface bloom-560m, use huggingface_bloom_convert.py to convert pytorch ckpt2binary, tp=1, dt=fp32 repo https://github.com/triton-inference-server/fastertransformer_backend/tree/v1.4/all_models/bloom
Reproduced Steps
- docker run -it --rm --gpus=all --shm-size=1g --ulimit memlock=-1 -v /data:/data --net=host triton_with_ft:22.12 bash
- CUDA_VISIBLE_DEVICES=0 tritonserver --model-repository=/data/bloom --strict-model-config=false --log-verbose=0 --metrics-port=6800 --grpc-port=6801 After loaded the ensemble model, gpu memory used 1854MiB
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:35:00.0 Off | 0 |
| N/A 38C P0 34W / 165W | 1854MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
- do infer use test.py as below: 'python3 test.py'
import argparse
import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient.utils import np_to_triton_dtype
def prepare_tensor(name, input):
t = grpcclient.InferInput(
name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--output-len',
type=int,
default=45,
required=False,
help='the number of tokens we hope model generating')
FLAGS = parser.parse_args()
url = "localhost:6801"
model_name = "ensemble"
with grpcclient.InferenceServerClient(url) as client:
input0 = []
stop_words_list = []
bad_words_list = []
for i in range(128):
input0.append(['A triumphal arch is a free-standing monumental structure in the shape of an archway with one or more arched passageways, often designed to span a road. In its simplest form a triumphal arch consists of two massive piers connected by an arch, crowned with a flat entablature or attic on which a statue might be mounted or which bears commemorative inscriptions. The main structure is often decorated with carvings, sculpted reliefs, and dedications. More elaborate triumphal arches may have multiple archways.'])
bad_words_list.append([""])
stop_words_list.append([""])
input0_data = np.array(input0).astype(object)
output0_len = np.ones_like(input0).astype(np.uint32) * FLAGS.output_len
is_return_log_probs = True * np.ones([input0_data.shape[0], 1]).astype(bool)
inputs = [
prepare_tensor("INPUT_0", input0_data),
prepare_tensor("INPUT_1", output0_len),
prepare_tensor("is_return_log_probs", is_return_log_probs),
prepare_tensor("INPUT_2", np.array(bad_words_list).astype(object)),
prepare_tensor("INPUT_3", np.array(stop_words_list).astype(object)),
]
for i in range(5):
result = client.infer(model_name, inputs)
output0 = result.as_numpy("OUTPUT_0")
#print(output0)
print(np.size(output0))
I use for i in range(128)
do batching, after infer, gpu memory used 5054MiB
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:35:00.0 Off | 0 |
| N/A 38C P0 33W / 165W | 6110MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
then, I change batch to 512, for i in range(512)
, gpu memory used 14686MiB
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:35:00.0 Off | 0 |
| N/A 39C P0 34W / 165W | 18846MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
and then, I change query inputs, batch=512, gpu memory used 24062MiB
input0.append(['A triumphal arch is a free-standing monumental structure in the shape of an archway with one or more arched passageways, often designed to span a road. In its simplest form a triumphal arch consists of two massive piers connected by an arch, crowned with a flat entablature or attic on which a statue might be mounted or which bears commemorative inscriptions. The main structure is often decorated with carvings, sculpted reliefs, and dedications. More elaborate triumphal arches may have multiple archways.Triumphal arches are one of the most influential and distinctive types of architecture associated with ancient Rome. Thought to have been invented by the Romans, the Roman triumphal arch was used to commemorate victorious generals or significant public events such as the founding of new colonies, the construction of a road or bridge, the death of a member of the imperial family or the ascension of a new emperor.The survival of great Roman triumphal arches such as the Arch of Titus or the Arch of Constantine has inspired many post-Roman states and rulers, up to the present day, to erect their own triumphal arches in emulation of the Romans. Triumphal arches in the Roman style have been built in many cities around the world, most notably the Arc de Triomphe in Paris, the Narva Triumphal Arch in Saint Petersburg, or the Wellington Arch in London.'])
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:35:00.0 Off | 0 |
| N/A 39C P0 34W / 165W | 24062MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
Client got en error as follows:
tritonclient.utils.InferenceServerException: [StatusCode.UNIMPLEMENTED] in ensemble 'ensemble', [request id: <id_unknown>] failed to split the output tensor 'error_message' in responses: expected batch size of atleast 512 in model output, got 1
Server got a warnning:
W0508 09:47:32.308338 3166 libfastertransformer.cc:1199] response is nullptr
I think gpu memory is not released after every inference. If each query request is longer than the previous one, gpu memory utilization will continue to increase. n multiple concurrent scenarios, the memory usage increases faster until oom, when the service is unavailable.
So I modify fastertransformer/triton_backend/multi_gpu_gpt/ParallelGptTritonModel.cc
, set is_free_buffer_after_forward=true
, and re-execute the above procedures,gpu memory is still not released.
ISSUE
How to free gpu memory after inference?
The key of the problem is that gpu memory is not released after every inference, which can be tested using the method mentioned above, gpu memory usage keeps adding up. thank you very much.
I am seeing a similar behavior.
same problem
same problem
+1