clearml-serving
clearml-serving copied to clipboard
serving stuck because of deleted model
My clearml serving deployment is stuck.
No models are registered,
clearml-serving --id 7303713271b941f7a0b45760d45208dd model list
clearml-serving - CLI for launching ClearML serving engine
List model serving and endpoints, control task id=7303713271b941f7a0b45760d45208dd
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Endpoints:
{}
Model Monitoring:
{}
Canary:
{}
However, old models are still somehow there:
serving-task:
There is a leftover model that I am unable to remove:
Triton-Task:
2023-11-20 16:18:40
ClearML Task: created new task id=9b3460b62f9d4015890c7dd2c0064bcf
2023-11-20 15:18:40,452 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: http://clearml-webserver:8080/projects/9b4bbac7f1c248e894793f5771005826/experiments/9b3460b62f9d4015890c7dd2c0064bcf/output/log
2023-11-20 16:18:40
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='7303713271b941f7a0b45760d45208dd', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)
String Triton Helper service
{'serving_id': '7303713271b941f7a0b45760d45208dd', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}
Updating local model folder: /models
2023-11-20 15:18:41,106 - clearml.Model - ERROR - Action failed <400/201: models.get_by_id/v1.0 (Invalid model id (no such public or company model): id=0bbba86c98c54610a14350ba69e2e330, company=d1bd92a3b039400cbafc60a7a5b1e52b)> (model=0bbba86c98c54610a14350ba69e2e330)
2023-11-20 15:18:41,107 - clearml.Model - ERROR - Failed reloading task 0bbba86c98c54610a14350ba69e2e330
2023-11-20 15:18:41,115 - clearml.Model - ERROR - Action failed <400/201: models.get_by_id/v1.0 (Invalid model id (no such public or company model): id=0bbba86c98c54610a14350ba69e2e330, company=d1bd92a3b039400cbafc60a7a5b1e52b)> (model=0bbba86c98c54610a14350ba69e2e330)
2023-11-20 15:18:41,115 - clearml.Model - ERROR - Failed reloading task 0bbba86c98c54610a14350ba69e2e330
2023-11-20 16:18:41
Traceback (most recent call last):
File "clearml_serving/engines/triton/triton_helper.py", line 540, in <module>
main()
File "clearml_serving/engines/triton/triton_helper.py", line 532, in main
helper.maintenance_daemon(
File "clearml_serving/engines/triton/triton_helper.py", line 237, in maintenance_daemon
self.model_service_update_step(model_repository_folder=local_model_repo, verbose=True)
File "clearml_serving/engines/triton/triton_helper.py", line 146, in model_service_update_step
print("Error retrieving model ID {} []".format(model_id, model.url if model else ''))
File "/usr/local/lib/python3.8/dist-packages/clearml/model.py", line 341, in url
return self._get_base_model().uri
File "/usr/local/lib/python3.8/dist-packages/clearml/backend_interface/model.py", line 496, in uri
return self.data.uri
AttributeError: 'NoneType' object has no attribute 'uri'
How can a broken task be fixed without deploying a new serving instance?