model_analyzer icon indicating copy to clipboard operation
model_analyzer copied to clipboard

Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

Open wxthu opened this issue 2 years ago • 12 comments

When I use model analyzer to profile multiple-models composed of resnext models in torchvision using pytorch backend, I always get the following info and get nothing in profile_results directory,

[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
  File "/usr/local/bin/model-analyzer", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/entrypoint.py", line 267, in main
    analyzer.profile(client=client,
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 116, in profile
    self._profile_models()
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 220, in _profile_models
    self._model_manager.run_models(models=models)
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 130, in run_models
    self._stop_ma_if_no_valid_measurement_threshold_reached()
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 218, in _stop_ma_if_no_valid_measurement_threshold_reached
    raise TritonModelAnalyzerException(
model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.

when I check PA logs generated by MA, I found that:

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

I do not know how to solve it

Further debugging, when I execute the command in PA logs above, I met the following error:

[1,1]<stdout>:[1692767719.500504] [xiangwang-System-Product-Name:1324 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.533941] [xiangwang-System-Product-Name:1323 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.540637] [xiangwang-System-Product-Name:1323 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,1]<stdout>:[1692767719.540801] [xiangwang-System-Product-Name:1324 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,0]<stderr>:
[1,1]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53772,1],0]
  Exit code:    99

wxthu avatar Aug 23 '23 04:08 wxthu

Can you post the full command you are using to run MA? Can you post the tritonserver output log? Have you confirmed that you can load and sucessfully run the model on a tritonserver?

nv-braf avatar Aug 23 '23 15:08 nv-braf

My command running MA as follows: model-analyzer profile -f config2.yml --triton-launch-mode=docker --triton-docker-shm-size=4g --output-model-repository-path /home/llc/triton-multi-model-profile/results --export-path profile_results

The output log in triton server :

[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA GeForce RTX 4090 with UUID GPU-c81ce6eb-cb1a-d91e-4d7f-0aa4bcdd0ef7
[Model Analyzer] WARNING: Overriding the output model repo path "/home/llc/triton-multi-model-profile/results"
[Model Analyzer] Starting a Triton Server using docker
[Model Analyzer] No checkpoint file found, starting a fresh run.
[Model Analyzer] Profiling server only metrics...
[Model Analyzer] 
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer] 
[Model Analyzer] 
[Model Analyzer] Creating model config: resnext_torch_config_default
[Model Analyzer] 
[Model Analyzer] 
[Model Analyzer] Creating model config: resnext_torch2_config_default
[Model Analyzer] 
[Model Analyzer] Profiling resnext_torch_config_default: client batch size=1, concurrency=1
[Model Analyzer] Profiling resnext_torch2_config_default: client batch size=1, concurrency=1
[Model Analyzer] 

and long long stuck and finally generated above error...

wxthu avatar Aug 24 '23 01:08 wxthu

Can you post the the contents of the config2.yml file? Can you also post/generate the triton server output log (you can do this by using the --triton-output-path option)? Have you confirmed that you can load and successfully run the model on a tritonserver?

nv-braf avatar Aug 24 '23 14:08 nv-braf

I have confirmed that I load and successfully run the model on a tritionserver and my config file as follows:

model_repository: "/home/llc/multi-model-triton/examples/quick-start"
run_config_profile_models_concurrently_enable: true
override_output_model_repository: true
client_protocol: "http"
run_config_search_mode: quick
run_config_search_max_instance_count: 1
run_config_search_max_concurrency: 1
run_config_search_min_model_batch_size: 5
run_config_search_max_model_batch_size: 5
num_configs_per_model: 4

profile_models:
  resnext_torch:
    constraints:
      perf_latency_p99:
        max: 20
    model_config_parameters:
      instance_group:
        - kind: KIND_GPU
          count: 1
  resnext_torch2:
    constraints:
      perf_latency_p99:
        max: 20
    model_config_parameters:
      instance_group:
        - kind: KIND_GPU
          count: 1

and triton-server generated log inside attachment

wxthu avatar Aug 24 '23 15:08 wxthu

Thank you. @matthewkotila this looks like an issue in Perf Analyzer. I don't see anything wrong with MA or the tritonserver.

nv-braf avatar Aug 24 '23 15:08 nv-braf

@wxthu @nv-braf can either of y'all do me a favor and give me minimally reproducible commands that only run PA? This will help my debugging a bunch.

matthewkotila avatar Aug 24 '23 15:08 matthewkotila

Is this what you're looking for?

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

nv-braf avatar Aug 24 '23 16:08 nv-braf

Ooh gotcha, missed that--thanks!

We're tracking this but we do not currently have an estimate of when we will complete debugging the issue.

matthewkotila avatar Aug 24 '23 22:08 matthewkotila

Facing the same issue : Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

anshudaur avatar Sep 13 '23 08:09 anshudaur

any update on this perf_analyzer took very long to exit, killing perf_analyzer [Model Analyzer] perf_analyzer did not produce any output.

riyajatar37003 avatar May 15 '24 12:05 riyajatar37003

[Model Analyzer] Initializing GPUDevice handles [Model Analyzer] Using GPU 0 NVIDIA A100-SXM4-40GB with UUID GPU-d9a0447f-f8fa-9d2f-79fc-ecf2567dacc2 [Model Analyzer] WARNING: Overriding the output model repo path "./rerenker_output1" [Model Analyzer] Starting a local Triton Server [Model Analyzer] Loaded checkpoint from file /model_repositories/checkpoints/0.ckpt [Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition [Model Analyzer] [Model Analyzer] Starting quick mode search to find optimal configs [Model Analyzer] [Model Analyzer] Creating model config: reranker_config_default [Model Analyzer] [Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_default [Model Analyzer] [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=24 [Model Analyzer] Profiling bge_reranker_v2_onnx_config_default: client batch size=1, concurrency=8 [Model Analyzer] [Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer [Model Analyzer] perf_analyzer did not produce any output. [Model Analyzer] Saved checkpoint to model_repositories/checkpoints/1.ckpt [Model Analyzer] Creating model config: reranker_config_0 [Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}] [Model Analyzer] Setting max_batch_size to 1 [Model Analyzer] Enabling dynamic_batching [Model Analyzer] [Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_0 [Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}] [Model Analyzer] Setting max_batch_size to 1 [Model Analyzer] Enabling dynamic_batching [Model Analyzer] [Model Analyzer] Profiling reranker_config_0: client batch size=1, concurrency=2 [Model Analyzer] Profiling bge_reranker_v2_onnx_config_0: client batch size=1, concurrency=2 [Model Analyzer] [Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer [Model Analyzer] perf_analyzer did not produce any output. [Model Analyzer] No changes made to analyzer data, no checkpoint saved. Traceback (most recent call last): File "/opt/app_venv/bin/model-analyzer", line 8, in sys.exit(main()) File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/entrypoint.py", line 278, in main analyzer.profile( File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 124, in profile self._profile_models() File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 233, in _profile_models self._model_manager.run_models(models=models) File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 145, in run_models self._stop_ma_if_no_valid_measurement_threshold_reached() File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 239, in _stop_ma_if_no_valid_measurement_threshold_reached raise TritonModelAnalyzerException( model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.

riyajatar37003 avatar May 15 '24 12:05 riyajatar37003

perf_error log

Command: mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m reranker -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 24 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000 : -n 1 perf_analyzer --enable-mpi -m onnx -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 8 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT.

riyajatar37003 avatar May 15 '24 14:05 riyajatar37003