tensorrtllm_backend
tensorrtllm_backend copied to clipboard
[tensorrt-llm backend] A question about launch_triton_server.py
Question
The codes in launch_triton_server.py:
def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
model_repo, log, log_file, tensorrt_llm_model_name):
cmd = ['mpirun', '--allow-run-as-root']
for i in range(world_size):
cmd += ['-n', '1', tritonserver, f'--model-repository={model_repo}']
if log and (i == 0):
cmd += ['--log-verbose=3', f'--log-file={log_file}']
# If rank is not 0, skip loading of models other than `tensorrt_llm_model_name`
if (i != 0):
cmd += ['--model-control-mode=explicit']
model_names = tensorrt_llm_model_name.split(',')
for name in model_names:
cmd += [f'--load-model={name}']
cmd += [
f'--grpc-port={grpc_port}', f'--http-port={http_port}',
f'--metrics-port={metrics_port}', '--disable-auto-complete-config',
f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
]
return cmd
When world_size = 2 for example, 2 triton servers will be launched using the same grpc port (e.g., 8001). But how could this be possible? When I tried to do something similar, I got the following error while launching the second server:
I0513 03:43:28.353306 21205 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0513 03:43:28.353458 21205 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
E0513 03:43:28.353559006 21206 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2024-05-13T03:43:28.353510541+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-05-13T03:43:28.353503146+00:00", children:[UNKNOWN:Address family not supported by protocol {target_address:"[::]:8001", syscall:"socket", os_error:"Address family not supported by protocol", errno:97, created_time:"2024-05-13T03:43:28.353465612+00:00"}, UNKNOWN:Unable to configure socket {fd:6, created_time:"2024-05-13T03:43:28.353493367+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-05-13T03:43:28.353488259+00:00"}]}]}]}
E0513 03:43:28.353650 21206 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use
Background
I've been developing my triton backend drawing on the experience of https://github.com/triton-inference-server/tensorrtllm_backend.
I have already built two engines (tensor parallel, tp_size = 2) of the llama2-7b model.
It's ok to run something like mpirun -np 2 python3.8 run.py to load the two engines, run tensor-parallel inference, and get the correct results.
My goal now is to run the same two engines by the triton server.
I have already implemented the run.py logic in the model.py (initialize() and execute() functions) in my python backend.
Following launch_triton_server.py, I tried the following command line:
mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ : -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --model-control-mode=explicit --load-model=llama2_7b --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix1_ :
Then I got the error as above.
Could you please tell me what I did wrong and how I can fix the error? Thanks a lot!
In tensorrt_llm_backend, when we launch several server by MPI with world_size > 1, only the rank 0 (main process) will recieve/return requests. Other ranks will skip this step and will not encounter issue of same port. So, you need to do similar thing if you want to use self-defined backend.
Any clue how to resolve this issue, please let me know?
i meet the same error, any solutions?
I used world size 4 and it worked
From: dwq370 @.> Sent: Friday, July 5, 2024 7:24:18 AM To: triton-inference-server/tensorrtllm_backend @.> Cc: Alok Kumar Sahu @.>; Comment @.> Subject: Re: [triton-inference-server/tensorrtllm_backend] [tensorrt-llm backend] A question about launch_triton_server.py (Issue #455)
i meet the same error, any solutions?
— Reply to this email directly, view it on GitHubhttps://github.com/triton-inference-server/tensorrtllm_backend/issues/455#issuecomment-2210258189, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMDHPKDFWEK4PCGPINTGYPTZKY3ZFAVCNFSM6AAAAABHWSGVBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGI2TQMJYHE. You are receiving this because you commented.Message ID: @.***>
i used world size 4 but it not worked, world size 2 worked
Okay
From: dwq370 @.> Sent: Friday, July 5, 2024 8:18:55 AM To: triton-inference-server/tensorrtllm_backend @.> Cc: Alok Kumar Sahu @.>; Comment @.> Subject: Re: [triton-inference-server/tensorrtllm_backend] [tensorrt-llm backend] A question about launch_triton_server.py (Issue #455)
i used world size 4 but it not worked, world size 2 worked
— Reply to this email directly, view it on GitHubhttps://github.com/triton-inference-server/tensorrtllm_backend/issues/455#issuecomment-2210333048, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMDHPKHRABVICFWNH4MWON3ZKZCF7AVCNFSM6AAAAABHWSGVBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGMZTGMBUHA. You are receiving this because you commented.Message ID: @.***>
In tensorrt_llm_backend, when we launch several server by MPI with world_size > 1, only the rank 0 (main process) will recieve/return requests. Other ranks will skip this step and will not encounter issue of same port. So, you need to do similar thing if you want to use self-defined backend.
Any examples? We have the same problem. We need to run trtllm in the python backend with tp_size > 1 for VLM model.