ml-commons
ml-commons copied to clipboard
[BUG] ML Model stuck in deploying status
What is the bug? It's reported from several cases that a ML model stuck in the "DEPLOYING" status and never refresh/update the status, when the cluster is under 1)heavy load or 2) scale up and downs. We need to revisit this function for edge cases that may cause the system into this condition. https://github.com/opensearch-project/ml-commons/blob/main/plugin/src/main/java/org/opensearch/ml/cluster/MLSyncUpCron.java#L363.
Currently this issue hasn't been reported by any open source customers so it's not duplicated in any open source environment. Developers will need to duplicate this error and propose a fix.
How can one reproduce the bug? Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
What is the expected behavior? A clear and concise description of what you expected to happen.
What is your host/environment?
- OS: [e.g. iOS]
- Version [e.g. 22]
- Plugins
Do you have any screenshots? If applicable, add screenshots to help explain your problem.
Do you have any additional context? Add any other context about the problem.