model_analyzer icon indicating copy to clipboard operation
model_analyzer copied to clipboard

Model load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit

Open helenHlz opened this issue 2 years ago • 3 comments

I refer to the example/quick-start and try to profile a resnet libtorch model. Then I got the error 'Model load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit'.

I guess something wrong with the grpc tritonclient in https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/init.py.

But I'm not sure how to fix this. Please contact us if you need more information.

Here is the full log. ''' 2022-07-27T14:34:16.352899264Z 14:34:16 [Model Analyzer] DEBUG: {'config_file': None, 'checkpoint_directory': '/bert-base-squad-tensorrt-1.0/checkpoints', 'monitoring_interval': 1.0, 'duration_seconds': 3, 'use_local_gpu_monitor': False, 'collect_cpu_metrics': False, 'gpus': ['all'], 'model_repository': '/fds/model/quick/', 'output_model_repository_path': '/fds/model/quick/profile', 'override_output_model_repository': True, 'client_max_retries': 50, 'client_protocol': 'grpc', 'perf_analyzer_flags': {}, 'triton_server_flags': {}, 'triton_server_environment': {}, 'objectives': {'perf_throughput': 10}, 'constraints': {}, 'profile_models': [{'model_name': 'resnet50_pytorch', 'cpu_only': False, 'objectives': {'perf_throughput': 10}, 'parameters': {'batch_sizes': [1], 'concurrency': []}}], 'batch_sizes': [1], 'concurrency': [], 'reload_model_disable': False, 'perf_analyzer_timeout': 600, 'perf_analyzer_cpu_util': 5760.0, 'perf_analyzer_path': 'perf_analyzer', 'perf_output': False, 'perf_output_path': None, 'perf_analyzer_max_auto_adjusts': 10, 'triton_launch_mode': 'local', 'triton_docker_image': 'nvcr.io/nvidia/tritonserver:21.10-py3', 'triton_http_endpoint': 'localhost:8000', 'triton_grpc_endpoint': 'localhost:8001', 'triton_metrics_url': 'http://localhost:8002/metrics', 'triton_server_path': 'tritonserver', 'triton_output_path': None, 'triton_docker_mounts': [], 'triton_docker_labels': {}, 'triton_docker_shm_size': None, 'triton_install_path': '/opt/tritonserver', 'early_exit_enable': False, 'run_config_search_max_concurrency': 1024, 'run_config_search_min_concurrency': 1, 'run_config_search_max_instance_count': 5, 'run_config_search_min_instance_count': 1, 'run_config_search_max_model_batch_size': 128, 'run_config_search_min_model_batch_size': 1, 'run_config_search_disable': False, 'run_config_profile_models_concurrently_enable': False} 2022-07-27T14:34:16.352929153Z 14:34:16 [Model Analyzer] Initializing GPUDevice handles 2022-07-27T14:34:17.435454191Z 14:34:17 [Model Analyzer] Using GPU 0 Tesla V100-PCIE-32GB with UUID GPU-c620fbbc-3252-1d4b-667d-3f21646f6b5c 2022-07-27T14:34:20.081366761Z 14:34:20 [Model Analyzer] WARNING: Overriding the output model repo path "/fds/model/quick/profile" 2022-07-27T14:34:20.524587827Z 14:34:20 [Model Analyzer] Starting a local Triton Server 2022-07-27T14:34:20.525187729Z 14:34:20 [Model Analyzer] No checkpoint file found, starting a fresh run. 2022-07-27T14:34:20.680349959Z 14:34:20 [Model Analyzer] Profiling server only metrics... 2022-07-27T14:34:20.702567928Z 14:34:20 [Model Analyzer] DEBUG: Triton Server started. 2022-07-27T14:34:28.885585400Z 14:34:28 [Model Analyzer] DEBUG: Stopped Triton Server. 2022-07-27T14:34:29.406083067Z 14:34:29 [Model Analyzer] 2022-07-27T14:34:29.406109004Z 14:34:29 [Model Analyzer] Creating model config: resnet50_pytorch_config_default 2022-07-27T14:34:29.406112445Z 14:34:29 [Model Analyzer] 2022-07-27T14:34:38.766553175Z 14:34:38 [Model Analyzer] DEBUG: Triton Server started. 2022-07-27T14:34:43.575598315Z 14:34:43 [Model Analyzer] Model resnet50_pytorch_config_default load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit 2022-07-27T14:34:44.991707136Z 14:34:44 [Model Analyzer] DEBUG: Stopped Triton Server. 2022-07-27T14:34:45.241710681Z 14:34:45 [Model Analyzer] 2022-07-27T14:34:45.241730406Z 14:34:45 [Model Analyzer] Creating model config: resnet50_pytorch_config_0 2022-07-27T14:34:45.241741249Z 14:34:45 [Model Analyzer] Enabling dynamic_batching 2022-07-27T14:34:45.241787618Z 14:34:45 [Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}] 2022-07-27T14:34:45.241801026Z 14:34:45 [Model Analyzer] Setting max_batch_size to 1 2022-07-27T14:34:45.241827149Z 14:34:45 [Model Analyzer] 2022-07-27T14:34:45.688863643Z 14:34:45 [Model Analyzer] DEBUG: Triton Server started. 2022-07-27T14:34:49.821318577Z 14:34:49 [Model Analyzer] Model resnet50_pytorch_config_0 load failed: [StatusCode.RESOURCE_EXHAUSTED] to-be-sent trailing metadata size exceeds peer limit ... '''

helenHlz avatar Jul 27 '22 14:07 helenHlz

Hi Helen, I'm sorry you are running into problems. I have never observed that error before. This is happening on a regular resnet model without any weird settings?

The error is coming from grpc: https://grpc.github.io/grpc/core/md_doc_statuscodes.html

Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.

Is your harddrive is full? Have you tried on a different machine to see if the issue is reproducible?

tgerdesnv avatar Jul 29 '22 21:07 tgerdesnv

Hi Helen, I'm sorry you are running into problems. I have never observed that error before. This is happening on a regular resnet model without any weird settings?

The error is coming from grpc: https://grpc.github.io/grpc/core/md_doc_statuscodes.html

Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.

Is your harddrive is full? Have you tried on a different machine to see if the issue is reproducible?

Yes, it's a regular resnet model that can be viewed in https://drive.google.com/drive/folders/1Umq7Z7l3JDBx2iJ4BRguM6NFD35jZqXF?usp=sharing and derived from the following code. ` import os import argparse

import torch import torchvision.models as models

parser = argparse.ArgumentParser() parser.add_argument('--output_model_repository_path', type=str) args = parser.parse_args()

output_engine_dirname = args.output_model_repository_path[: -len(args.output_model_repository_path.split("/")[-1])] if not os.path.exists(output_engine_dirname): os.makedirs(output_engine_dirname)

resnet50 = models.resnet50(pretrained=True) resnet50.eval() image = torch.randn(1, 3, 244, 244) resnet50_traced = torch.jit.trace(resnet50, image) resnet50(image) resnet50_traced.save(args.output_model_repository_path) `

I allocated a 16GB memory and one v100-32G gpu for this task. From model analyzer log, do you know if the error occurred on the triton server or the triton client?

helenHlz avatar Aug 01 '22 06:08 helenHlz

I ran the model locally and the profile was able to complete.
I am inclined to agree with Tim at this point. I think the local machine this was run on is nearly full. Can you try and run this on another machine to see if it is repeatable?

debermudez avatar Aug 02 '22 22:08 debermudez

Closing due to inactivity. Please re-open if this is still happening and you have more information

debermudez avatar Aug 19 '22 15:08 debermudez