ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

[BUG] Intermittent Model Deployment Issues in AWS OpenSearch: "Model not ready yet" Error

Open ulan-yisaev opened this issue 1 year ago • 8 comments

Describe the bug

Issue Description: After successfully deploying a machine learning model in AWS OpenSearch using the POST request to /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy, the model initially shows as deployed when checked with GET {{opensearch_host}}/_plugins/_ml/profile/models:

{
    "nodes": {
        "eU2R8fKxSMazHB6HlFbpNg": {
            "models": {
                "WgtRZY0BjrbwPvlSkd2A": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@405c0c42",
                    "target_worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ],
                    "worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ]
                }
            }
        },
        "GlBdA3joT4ebA2Wj5yd_8Q": {
            "models": {
                "WgtRZY0BjrbwPvlSkd2A": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@85b1387",
                    "target_worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ],
                    "worker_nodes": [
                        "GlBdA3joT4ebA2Wj5yd_8Q",
                        "eU2R8fKxSMazHB6HlFbpNg"
                    ]
                }
            }
        }
    }
}

However, after a certain period, possibly following cluster restarts, the model becomes unavailable, leading to an error: RequestError(400, 'illegal_argument_exception', 'Model not ready yet. Please run this first: POST /_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy'). Additionally, the model profile then shows empty results, indicating the model is no longer deployed.

Related component

Plugins

To Reproduce

  1. Deploy the model using the POST request: POST {{opensearch_host}}/_plugins/_ml/models/WgtRZY0BjrbwPvlSkd2A/_deploy.
  2. Verify successful deployment with GET {{opensearch_host}}/_plugins/_ml/profile/models, which shows the model state as "DEPLOYED".
  3. Wait for some time or until after a potential cluster restart.
  4. Observe the error RequestError(400, 'illegal_argument_exception', 'Model not ready yet...') when attempting to use the model.
  5. Check the model profile again with GET {{opensearch_host}}/_plugins/_ml/profile/models, which now returns empty results for the model.

Expected behavior

Once a model is deployed, it should remain available for use without needing re-deployment, especially after cluster restarts. The model's status should consistently reflect its actual state, and any errors related to deployment should be clearly explained or resolved automatically by the system.

Additional Details

Plugins

"plugins": {
            "ml_commons": {
                "only_run_on_ml_node": "false",
                "trusted_connector_endpoints_regex": [
                    "^https://.*\\.openai\\.azure\\.com/.*$"
                ],
                "model_access_control_enabled": "true",
                "native_memory_threshold": "99"
            },
            "index_state_management": {
                "template_migration": {
                    "control": "-1"
                }
            },
            "query": {
                "executionengine": {}
            }
        },

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: Ubuntu Linux
  • Version latest

Additional context This issue seems to occur intermittently and might be linked to cluster restarts or other changes in the AWS OpenSearch environment. It impacts the reliability of machine learning model deployments in production settings.

ulan-yisaev avatar Feb 07 '24 20:02 ulan-yisaev

I'll move this to the ml plugin. Did you get a chance to open a ticket with AWS yet?

dblock avatar Feb 07 '24 21:02 dblock

@dblock , thanks. Regarding your question about opening a ticket with AWS, I haven't done so yet. Before I proceed with that, I wanted to mention that this issue is not exclusive to the AWS OpenSearch. I have encountered the same problem on my local OpenSearch instance running in Docker, especially after restarting the container. This similarity suggests that the issue might be inherent to the OpenSearch ML plugin rather than specific to the AWS environment.

ulan-yisaev avatar Feb 08 '24 09:02 ulan-yisaev

@ulan-yisaev , Thanks for reporting this issue.

However, after a certain period, possibly following cluster restarts, the model becomes unavailable

Have you enabled the auto redeploy setting ? https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#enable-auto-redeploy

There is one known issue for model auto redeploy can't work for cluster restart case which fixed in this PR https://github.com/opensearch-project/ml-commons/pull/1627, the fix will be in 2.12

ylwu-amzn avatar Feb 08 '24 18:02 ylwu-amzn

Hello @ylwu-amzn!

Thank you for your response and the information provided.

I have indeed enabled the auto redeploy setting. However, after some time, I can confirm that this setting does not seem to resolve the issue. The models again disappeared from the deployment, and I had to redeploy them manually. It appears that the auto redeploy feature is not functioning as expected, at least in my setup.

I appreciate the reference to PR #1627 and look forward to the fix in version 2.12. In the meantime, if there are any other suggestions or workarounds to mitigate this issue, I would be grateful to hear them.

ulan-yisaev avatar Feb 09 '24 20:02 ulan-yisaev

One workaround is using Lambda function to regularly check the model status and auto redeploy it if the model is undeployed. Deploy API is idempotent, so you can also just run it regularly.

We built a sample lambda function. You can try this. Follow the readme in the zip. auto-redeploy-model-lambda-function.zip

ylwu-amzn avatar Feb 09 '24 20:02 ylwu-amzn

Thank you for the workaround suggestion and for providing the sample Lambda function. I appreciate your support. I will try implementing this logic in my environment. Additionally, I'm considering adding exception handling to automatically redeploy the model in case of above exception.

ulan-yisaev avatar Feb 12 '24 09:02 ulan-yisaev

@Zhangxunmt has some proposal https://github.com/opensearch-project/ml-commons/issues/1148 to automatically deploy model when predict request comes and model not deployed. He has a PoC ready.

I think that way could solve your problem. But feel free to come up with different options and discuss.

ylwu-amzn avatar Feb 13 '24 07:02 ylwu-amzn

@ulan-yisaev I see you used remote model in this case so I think the auto-deploy with TTL strategy will work for you. The auto-deploy of local/customer model is not in the scope yet since that will block the "Predict" process and adds more confusion. In short, manual deployment of local/custom models are still required. But the remote models will be included in the auto-deploy feature with TTL.

Zhangxunmt avatar Feb 15 '24 18:02 Zhangxunmt

@ulan-yisaev The remote model auto deploy is release in 2.13. Auto model undeploy with TTL is release in 2.14. There shouldn't be any problem in this issue anymore. Closing now. Feel free to reopen.

Zhangxunmt avatar May 07 '24 18:05 Zhangxunmt