modelmesh icon indicating copy to clipboard operation
modelmesh copied to clipboard

feat: Add kv-store connection check to readiness probe

Open Legion2 opened this issue 11 months ago • 4 comments

Motivation

When the modelmesh is not able to connect to the kv store to update its instance recording or sync with the other instances it can not reliably serve inference requests. For a short time a disconnect can be tolerated and the cached values can be used to serve requests. However after some time the data may be stale and the routing of requests may result in errors. For example with instance A and B, if A has connection issues while B leaves the mesh, A still have the outdated instance record and will still route inference requests to B, which fail. To prevent this, A should be marked unready if it can not connect to the kv store, to inform upstream proxies to not route traffic to A.

Modifications

Add existing verifyKvStoreConnection check to isReady check.

Result

isReady will return false when the model mesh instance lost connection to the kv store. Allowing systems such as kubernetes to react to this condition.

Legion2 avatar Mar 19 '24 13:03 Legion2

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Legion2 Once this PR has been reviewed and has the lgtm label, please assign rafvasq for approval by writing /assign @rafvasq in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

oss-prow-bot[bot] avatar Mar 19 '24 13:03 oss-prow-bot[bot]

Hmm, I'm not sure about this behavior being the default.

As stated, ModelMesh is intended to handle downtime in etcd by continuing to serve requests for existing models (adding models and scaling would fail). If all pods reported as Unready when etcd goes down, then no requests would succeed. Even if the downtime extends over a longer period, I think it would be better to serve the requests it can until the system recovers.

Where this change would help is in the situation described in the issue where one or two of the pods are disconnected from the KV store, but this situation is not easily distinguishable from a complete etcd downtime at the pod level. Merging this PR would prevent partial service in the case that etcd does go down. 🤔

That said, depending on the failure mode, perhaps reporting as NOT READY would be useful in some circumstances. If you do want this behavior for your use-case, I think we can add an experimental feature-flag via environment variable to control this behavior (off by default).

tjohnson31415 avatar Apr 09 '24 19:04 tjohnson31415

Thanks for the feedback, I will add a feature flag for it.

Legion2 avatar Apr 13 '24 09:04 Legion2

How should I implement the feature flag, as command line argument or as environment variable? I had a quick look at the code and I don't understand how to add a new command line argument or if there is any util for using features flags from the environment variables.

Legion2 avatar Jul 07 '24 12:07 Legion2