model_analyzer
model_analyzer copied to clipboard
Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT
When I use model analyzer to profile multiple-models composed of resnext models in torchvision using pytorch backend, I always get the following info and get nothing in profile_results directory,
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
File "/usr/local/bin/model-analyzer", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/entrypoint.py", line 267, in main
analyzer.profile(client=client,
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 116, in profile
self._profile_models()
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 220, in _profile_models
self._model_manager.run_models(models=models)
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 130, in run_models
self._stop_ma_if_no_valid_measurement_threshold_reached()
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 218, in _stop_ma_if_no_valid_measurement_threshold_reached
raise TritonModelAnalyzerException(
model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.
when I check PA logs generated by MA, I found that:
Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0
Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT
I do not know how to solve it
Further debugging, when I execute the command in PA logs above, I met the following error:
[1,1]<stdout>:[1692767719.500504] [xiangwang-System-Product-Name:1324 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.533941] [xiangwang-System-Product-Name:1323 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.540637] [xiangwang-System-Product-Name:1323 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,1]<stdout>:[1692767719.540801] [xiangwang-System-Product-Name:1324 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,0]<stderr>:
[1,1]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[53772,1],0]
Exit code: 99
Can you post the full command you are using to run MA? Can you post the tritonserver output log? Have you confirmed that you can load and sucessfully run the model on a tritonserver?
My command running MA as follows:
model-analyzer profile -f config2.yml --triton-launch-mode=docker --triton-docker-shm-size=4g --output-model-repository-path /home/llc/triton-multi-model-profile/results --export-path profile_results
The output log in triton server :
[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA GeForce RTX 4090 with UUID GPU-c81ce6eb-cb1a-d91e-4d7f-0aa4bcdd0ef7
[Model Analyzer] WARNING: Overriding the output model repo path "/home/llc/triton-multi-model-profile/results"
[Model Analyzer] Starting a Triton Server using docker
[Model Analyzer] No checkpoint file found, starting a fresh run.
[Model Analyzer] Profiling server only metrics...
[Model Analyzer]
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer]
[Model Analyzer]
[Model Analyzer] Creating model config: resnext_torch_config_default
[Model Analyzer]
[Model Analyzer]
[Model Analyzer] Creating model config: resnext_torch2_config_default
[Model Analyzer]
[Model Analyzer] Profiling resnext_torch_config_default: client batch size=1, concurrency=1
[Model Analyzer] Profiling resnext_torch2_config_default: client batch size=1, concurrency=1
[Model Analyzer]
and long long stuck and finally generated above error...
Can you post the the contents of the config2.yml file?
Can you also post/generate the triton server output log (you can do this by using the --triton-output-path option)?
Have you confirmed that you can load and successfully run the model on a tritonserver?
I have confirmed that I load and successfully run the model on a tritionserver and my config file as follows:
model_repository: "/home/llc/multi-model-triton/examples/quick-start"
run_config_profile_models_concurrently_enable: true
override_output_model_repository: true
client_protocol: "http"
run_config_search_mode: quick
run_config_search_max_instance_count: 1
run_config_search_max_concurrency: 1
run_config_search_min_model_batch_size: 5
run_config_search_max_model_batch_size: 5
num_configs_per_model: 4
profile_models:
resnext_torch:
constraints:
perf_latency_p99:
max: 20
model_config_parameters:
instance_group:
- kind: KIND_GPU
count: 1
resnext_torch2:
constraints:
perf_latency_p99:
max: 20
model_config_parameters:
instance_group:
- kind: KIND_GPU
count: 1
and triton-server generated log inside attachment
Thank you. @matthewkotila this looks like an issue in Perf Analyzer. I don't see anything wrong with MA or the tritonserver.
@wxthu @nv-braf can either of y'all do me a favor and give me minimally reproducible commands that only run PA? This will help my debugging a bunch.
Is this what you're looking for?
Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0
Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT
Ooh gotcha, missed that--thanks!
We're tracking this but we do not currently have an estimate of when we will complete debugging the issue.
Facing the same issue : Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT
any update on this perf_analyzer took very long to exit, killing perf_analyzer [Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA A100-SXM4-40GB with UUID GPU-d9a0447f-f8fa-9d2f-79fc-ecf2567dacc2
[Model Analyzer] WARNING: Overriding the output model repo path "./rerenker_output1"
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Loaded checkpoint from file /model_repositories/checkpoints/0.ckpt
[Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition
[Model Analyzer]
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer]
[Model Analyzer] Creating model config: reranker_config_default
[Model Analyzer]
[Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_default
[Model Analyzer]
[Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=24
[Model Analyzer] Profiling bge_reranker_v2_onnx_config_default: client batch size=1, concurrency=8
[Model Analyzer]
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] Saved checkpoint to model_repositories/checkpoints/1.ckpt
[Model Analyzer] Creating model config: reranker_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
[Model Analyzer]
[Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
[Model Analyzer]
[Model Analyzer] Profiling reranker_config_0: client batch size=1, concurrency=2
[Model Analyzer] Profiling bge_reranker_v2_onnx_config_0: client batch size=1, concurrency=2
[Model Analyzer]
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
File "/opt/app_venv/bin/model-analyzer", line 8, in
perf_error log
Command: mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m reranker -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 24 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000 : -n 1 perf_analyzer --enable-mpi -m onnx -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 8 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000
Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT.