ml-commons
ml-commons copied to clipboard
[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error
Describe the bug
Issue Description:
After successfully deploying a machine learning model in AWS OpenSearch using the POST request to /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy, the model initially shows as deployed when checked with
GET {{opensearch_host}}/_plugins/_ml/profile/models:
{
"nodes": {
"eU2R8fKxSMazHB6HlFbpNg": {
"models": {
"WgtRZY0BjrbwPvlSkd2A": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@405c0c42",
"target_worker_nodes": [
"GlBdA3joT4ebA2Wj5yd_8Q",
"eU2R8fKxSMazHB6HlFbpNg"
],
"worker_nodes": [
"GlBdA3joT4ebA2Wj5yd_8Q",
"eU2R8fKxSMazHB6HlFbpNg"
]
}
}
},
"GlBdA3joT4ebA2Wj5yd_8Q": {
"models": {
"WgtRZY0BjrbwPvlSkd2A": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@85b1387",
"target_worker_nodes": [
"GlBdA3joT4ebA2Wj5yd_8Q",
"eU2R8fKxSMazHB6HlFbpNg"
],
"worker_nodes": [
"GlBdA3joT4ebA2Wj5yd_8Q",
"eU2R8fKxSMazHB6HlFbpNg"
]
}
}
}
}
}
However, after a certain period, possibly following cluster restarts, the model becomes unavailable, leading to an error: RequestError(400, 'illegal_argument_exception', 'Model not ready yet. Please run this first: POST /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy'). Additionally, the model profile then shows empty results, indicating the model is no longer deployed.
Related component
Plugins
To Reproduce
- Deploy the model using the POST request:
POST {{opensearch_host}}/_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy. - Verify successful deployment with
GET {{opensearch_host}}/_plugins/_ml/profile/models, which shows the model state as "DEPLOYED". - Wait for some time or until after a potential cluster restart.
- Observe the error
RequestError(400, 'illegal_argument_exception', 'Model not ready yet...')when attempting to use the model. - Check the model profile again with
GET {{opensearch_host}}/_plugins/_ml/profile/models, which now returns empty results for the model.
Expected behavior
Once a model is deployed, it should remain available for use without needing re-deployment, especially after cluster restarts. The model's status should consistently reflect its actual state, and any errors related to deployment should be clearly explained or resolved automatically by the system.
Additional Details
Plugins
"plugins": {
"ml_commons": {
"only_run_on_ml_node": "false",
"trusted_connector_endpoints_regex": [
"^https://.*\\.openai\\.azure\\.com/.*$"
],
"model_access_control_enabled": "true",
"native_memory_threshold": "99"
},
"index_state_management": {
"template_migration": {
"control": "-1"
}
},
"query": {
"executionengine": {}
}
},
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: Ubuntu Linux
- Version latest
Additional context This issue seems to occur intermittently and might be linked to cluster restarts or other changes in the AWS OpenSearch environment. It impacts the reliability of machine learning model deployments in production settings.
I'll move this to the ml plugin. Did you get a chance to open a ticket with AWS yet?
@dblock , thanks. Regarding your question about opening a ticket with AWS, I haven't done so yet. Before I proceed with that, I wanted to mention that this issue is not exclusive to the AWS OpenSearch. I have encountered the same problem on my local OpenSearch instance running in Docker, especially after restarting the container. This similarity suggests that the issue might be inherent to the OpenSearch ML plugin rather than specific to the AWS environment.
@ulan-yisaev , Thanks for reporting this issue.
However, after a certain period, possibly following cluster restarts, the model becomes unavailable
Have you enabled the auto redeploy setting ? https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#enable-auto-redeploy
There is one known issue for model auto redeploy can't work for cluster restart case which fixed in this PR https://github.com/opensearch-project/ml-commons/pull/1627, the fix will be in 2.12
Hello @ylwu-amzn!
Thank you for your response and the information provided.
I have indeed enabled the auto redeploy setting. However, after some time, I can confirm that this setting does not seem to resolve the issue. The models again disappeared from the deployment, and I had to redeploy them manually. It appears that the auto redeploy feature is not functioning as expected, at least in my setup.
I appreciate the reference to PR #1627 and look forward to the fix in version 2.12. In the meantime, if there are any other suggestions or workarounds to mitigate this issue, I would be grateful to hear them.
One workaround is using Lambda function to regularly check the model status and auto redeploy it if the model is undeployed. Deploy API is idempotent, so you can also just run it regularly.
We built a sample lambda function. You can try this. Follow the readme in the zip. auto-redeploy-model-lambda-function.zip
Thank you for the workaround suggestion and for providing the sample Lambda function. I appreciate your support. I will try implementing this logic in my environment. Additionally, I'm considering adding exception handling to automatically redeploy the model in case of above exception.
@Zhangxunmt has some proposal https://github.com/opensearch-project/ml-commons/issues/1148 to automatically deploy model when predict request comes and model not deployed. He has a PoC ready.
I think that way could solve your problem. But feel free to come up with different options and discuss.
@ulan-yisaev I see you used remote model in this case so I think the auto-deploy with TTL strategy will work for you. The auto-deploy of local/customer model is not in the scope yet since that will block the "Predict" process and adds more confusion. In short, manual deployment of local/custom models are still required. But the remote models will be included in the auto-deploy feature with TTL.
@ulan-yisaev The remote model auto deploy is release in 2.13. Auto model undeploy with TTL is release in 2.14. There shouldn't be any problem in this issue anymore. Closing now. Feel free to reopen.