model_analyzer Model load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit

I refer to the example/quick-start and try to profile a resnet libtorch model. Then I got the error 'Model load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit'.

I guess something wrong with the grpc tritonclient in https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/init.py.

But I'm not sure how to fix this. Please contact us if you need more information.

Here is the full log. ''' 2022-07-27T14:34:16.352899264Z 14:34:16 [Model Analyzer] DEBUG: {'config_file': None, 'checkpoint_directory': '/bert-base-squad-tensorrt-1.0/checkpoints', 'monitoring_interval': 1.0, 'duration_seconds': 3, 'use_local_gpu_monitor': False, 'collect_cpu_metrics': False, 'gpus': ['all'], 'model_repository': '/fds/model/quick/', 'output_model_repository_path': '/fds/model/quick/profile', 'override_output_model_repository': True, 'client_max_retries': 50, 'client_protocol': 'grpc', 'perf_analyzer_flags': {}, 'triton_server_flags': {}, 'triton_server_environment': {}, 'objectives': {'perf_throughput': 10}, 'constraints': {}, 'profile_models': [{'model_name': 'resnet50_pytorch', 'cpu_only': False, 'objectives': {'perf_throughput': 10}, 'parameters': {'batch_sizes': [1], 'concurrency': []}}], 'batch_sizes': [1], 'concurrency': [], 'reload_model_disable': False, 'perf_analyzer_timeout': 600, 'perf_analyzer_cpu_util': 5760.0, 'perf_analyzer_path': 'perf_analyzer', 'perf_output': False, 'perf_output_path': None, 'perf_analyzer_max_auto_adjusts': 10, 'triton_launch_mode': 'local', 'triton_docker_image': 'nvcr.io/nvidia/tritonserver:21.10-py3', 'triton_http_endpoint': 'localhost:8000', 'triton_grpc_endpoint': 'localhost:8001', 'triton_metrics_url': 'http://localhost:8002/metrics', 'triton_server_path': 'tritonserver', 'triton_output_path': None, 'triton_docker_mounts': [], 'triton_docker_labels': {}, 'triton_docker_shm_size': None, 'triton_install_path': '/opt/tritonserver', 'early_exit_enable': False, 'run_config_search_max_concurrency': 1024, 'run_config_search_min_concurrency': 1, 'run_config_search_max_instance_count': 5, 'run_config_search_min_instance_count': 1, 'run_config_search_max_model_batch_size': 128, 'run_config_search_min_model_batch_size': 1, 'run_config_search_disable': False, 'run_config_profile_models_concurrently_enable': False} 2022-07-27T14:34:16.352929153Z 14:34:16 [Model Analyzer] Initializing GPUDevice handles 2022-07-27T14:34:17.435454191Z 14:34:17 [Model Analyzer] Using GPU 0 Tesla V100-PCIE-32GB with UUID GPU-c620fbbc-3252-1d4b-667d-3f21646f6b5c 2022-07-27T14:34:20.081366761Z 14:34:20 [Model Analyzer] WARNING: Overriding the output model repo path "/fds/model/quick/profile" 2022-07-27T14:34:20.524587827Z 14:34:20 [Model Analyzer] Starting a local Triton Server 2022-07-27T14:34:20.525187729Z 14:34:20 [Model Analyzer] No checkpoint file found, starting a fresh run. 2022-07-27T14:34:20.680349959Z 14:34:20 [Model Analyzer] Profiling server only metrics... 2022-07-27T14:34:20.702567928Z 14:34:20 [Model Analyzer] DEBUG: Triton Server started. 2022-07-27T14:34:28.885585400Z 14:34:28 [Model Analyzer] DEBUG: Stopped Triton Server. 2022-07-27T14:34:29.406083067Z 14:34:29 [Model Analyzer] 2022-07-27T14:34:29.406109004Z 14:34:29 [Model Analyzer] Creating model config: resnet50_pytorch_config_default 2022-07-27T14:34:29.406112445Z 14:34:29 [Model Analyzer] 2022-07-27T14:34:38.766553175Z 14:34:38 [Model Analyzer] DEBUG: Triton Server started. 2022-07-27T14:34:43.575598315Z 14:34:43 [Model Analyzer] Model resnet50_pytorch_config_default load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit 2022-07-27T14:34:44.991707136Z 14:34:44 [Model Analyzer] DEBUG: Stopped Triton Server. 2022-07-27T14:34:45.241710681Z 14:34:45 [Model Analyzer] 2022-07-27T14:34:45.241730406Z 14:34:45 [Model Analyzer] Creating model config: resnet50_pytorch_config_0 2022-07-27T14:34:45.241741249Z 14:34:45 [Model Analyzer] Enabling dynamic_batching 2022-07-27T14:34:45.241787618Z 14:34:45 [Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}] 2022-07-27T14:34:45.241801026Z 14:34:45 [Model Analyzer] Setting max_batch_size to 1 2022-07-27T14:34:45.241827149Z 14:34:45 [Model Analyzer] 2022-07-27T14:34:45.688863643Z 14:34:45 [Model Analyzer] DEBUG: Triton Server started. 2022-07-27T14:34:49.821318577Z 14:34:49 [Model Analyzer] Model resnet50_pytorch_config_0 load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit ... '''

Jul 27 '22 14:07 helenHlz

Hi Helen, I'm sorry you are running into problems. I have never observed that error before. This is happening on a regular resnet model without any weird settings?

The error is coming from grpc: https://grpc.github.io/grpc/core/md_doc_statuscodes.html

Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.

Is your harddrive is full? Have you tried on a different machine to see if the issue is reproducible?

Jul 29 '22 21:07 tgerdesnv

Hi Helen, I'm sorry you are running into problems. I have never observed that error before. This is happening on a regular resnet model without any weird settings?

The error is coming from grpc: https://grpc.github.io/grpc/core/md_doc_statuscodes.html

Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.

Is your harddrive is full? Have you tried on a different machine to see if the issue is reproducible?

Yes, it's a regular resnet model that can be viewed in https://drive.google.com/drive/folders/1Umq7Z7l3JDBx2iJ4BRguM6NFD35jZqXF?usp=sharing and derived from the following code. ` import os import argparse

import torch import torchvision.models as models

parser = argparse.ArgumentParser() parser.add_argument('--output_model_repository_path', type=str) args = parser.parse_args()

output_engine_dirname = args.output_model_repository_path[: -len(args.output_model_repository_path.split("/")[-1])] if not os.path.exists(output_engine_dirname): os.makedirs(output_engine_dirname)

resnet50 = models.resnet50(pretrained=True) resnet50.eval() image = torch.randn(1, 3, 244, 244) resnet50_traced = torch.jit.trace(resnet50, image) resnet50(image) resnet50_traced.save(args.output_model_repository_path) `

I allocated a 16GB memory and one v100-32G gpu for this task. From model analyzer log, do you know if the error occurred on the triton server or the triton client?

Aug 01 '22 06:08 helenHlz

I ran the model locally and the profile was able to complete.
I am inclined to agree with Tim at this point. I think the local machine this was run on is nearly full. Can you try and run this on another machine to see if it is repeatable?

Aug 02 '22 22:08 debermudez

Closing due to inactivity. Please re-open if this is still happening and you have more information

Aug 19 '22 15:08 debermudez

model_analyzer model_analyzer copied to clipboard

Model load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit

model_analyzer
model_analyzer copied to clipboard