ml-commons
ml-commons copied to clipboard
[FEATURE] Add retry mechanism so predict API can success in node crash case
Is your feature request related to a problem? Currently ml-commons predict API doesn't have retry mechanism so when encountered node crash case, user will get failure results, e.g.: A cluster has two nodes A and B, both nodes has a model routing table internally, when A received a request, it will dispatch to B or process by itself based on round robin. Let's say A dispatched the request to B and B crashed, then user will get error response.
What solution would you like? We should add retryableListener to handle this case, when it's an retryable exception we can retry by dispatching the request to next node.
What alternatives have you considered?
Do you have any additional context? This is more like a general issue and this can be made in OpenSearch core, so other plugins have the similar issue can rely on this as well, created an issue in OpenSearch core regarding this: https://github.com/opensearch-project/OpenSearch/issues/13157