ml-commons
ml-commons copied to clipboard
[BUG] model deployment DOES NOT fail when there is exception "Failed to retrieve model" due to TransportService "discovery node must not be null"
What is the bug? When using OpenSearch oeprator, the following bug appears possibly when the cluster is not yet fully initialized. It is showing "green" status when the model is being deployed. There is no recovery from this error as the model task is in RUNNING state indefinitely after this exception. The bug must be in how the transport service is set up but also in the MLModelManager not changing the model task to FAILED after this exception.
[2025-02-25T01:14:55,229][ERROR][o.o.m.m.MLModelManager ] [opensearch-cluster-nodes-0] Failed to retrieve model EMCqOpUB3vMjZnAnVMU_
java.lang.NullPointerException: discovery node must not be null
at java.base/java.util.Objects.requireNonNull(Objects.java:259) ~[?:?]
at org.opensearch.transport.TransportService.isLocalNode(TransportService.java:1669) ~[opensearch-2.19.0.jar:2.19.0]
at org.opensearch.transport.TransportService.getConnection(TransportService.java:906) ~[opensearch-2.19.0.jar:2.19.0]
at org.opensearch.transport.TransportService.sendRequest(TransportService.java:850) ~[opensearch-2.19.0.jar:2.19.0]
at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$3(TransportDeployModelOnNodeAction.java:191) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$49(MLModelManager.java:1289) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$77(MLModelManager.java:2150) [opensearch-ml-2.19.0.0.jar:2.19.0.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.19.0.jar:2.19.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.0.jar:2.19.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.0.jar:2.19.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-02-25T01:14:55,280][ERROR][o.o.m.m.MLModelManager ] [opensearch-cluster-nodes-0] Failed to retrieve model chunk EMCqOpUB3vMjZnAnVMU__9
java.lang.NullPointerException: discovery node must not be null
at java.base/java.util.Objects.requireNonNull(Objects.java:259) ~[?:?]
at org.opensearch.transport.TransportService.isLocalNode(TransportService.java:1669) ~[opensearch-2.19.0.jar:2.19.0]
at org.opensearch.transport.TransportService.getConnection(TransportService.java:906) ~[opensearch-2.19.0.jar:2.19.0]
at org.opensearch.transport.TransportService.sendRequest(TransportService.java:850) ~[opensearch-2.19.0.jar:2.19.0]
at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$3(TransportDeployModelOnNodeAction.java:191) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.ml.model.MLModelManager.handleDeployModelException(MLModelManager.java:1532) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$50(MLModelManager.java:1294) [opensearch-ml-2.19.0.0.jar:2.19.0.0]
at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$77(MLModelManager.java:2150) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.0.jar:2.19.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.19.0.jar:2.19.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.0.jar:2.19.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.0.jar:2.19.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
How can one reproduce the bug?
- Install opensearch operator. See example nodePools section of the helm chart below:
nodePools:
- component: nodes
replicas: 3
diskSize: 8Gi
resources:
requests:
memory: "4Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "500m"
roles: [ cluster_manager, data, ml, ingest ]
env:
# Disable demo security config.
- name: DISABLE_INSTALL_DEMO_CONFIG
value: "true"
probes:
startup:
initialDelaySeconds: 30
periodSeconds: 20
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 30
- Deploy a custom model via API.
- Observe the error in ~50% of the cases.
What is the expected behavior? At the minimum, the model task must fail in case of such exceptions. Ideally, the transport service is initialized correctly so it does not fail.
What is your host/environment? OS: Ubuntu 24.04 Version: 2.19.0
Do you have any screenshots? N/A
Do you have any additional context? This is critical for us. The work-around is bad - after certain time of observing ML task in RUNNING state we can abandon it and retry.
@nathaliellenaa could you please try to reproduce the issue in your end?
@maxlepikhin if you can give more step by step process for @nathaliellenaa to reproduce the issue, that'll be helpful.
Few questions:
- I assume this is a local model. Is that correct?
- Have you tried any pre-trained models? Does the issue persist with all types of local models?
- How did you set up the OpenSearch operator?
- Did the issue only resurface in the 2.19 release?
- I assume this is a local model. Is that correct?
The model is local in the sense that it is mounted to the opensearch container under /usr/shared/opensearch/custom_models. The deployModel REST call points to that path. In a single replica deployment w/o the operator it always worked w/o a problem. Somehow multiple replicas cause this issue. Again, at a minimum the deployment task must fail so that it can be retried from the client side.
- Have you tried any pre-trained models? Does the issue persist with all types of local models?
This is one of the pre-trained models, our requirement is that we must deploy from our repository w/o allowing access to other websites. So the model.zip, config.json are actually baked into the image.
- How did you set up the OpenSearch operator?
Per operator guide:
helm repo add opensearch-operator https://opensearch-project.github.io/opensearch-k8s-operator/
helm install opensearch-operator opensearch-operator/opensearch-operator
- Did the issue only resurface in the 2.19 release?
I did not test earlier version with the operator. We are actually waiting eagerly 2.19.1 as it has hot reload of TLS certs.
More detailed steps for repro in a minikube for example (setting up minikube locally is beyond the steps):
- Install the operator:
helm repo add opensearch-operator https://opensearch-project.github.io/opensearch-k8s-operator/
helm install opensearch-operator opensearch-operator/opensearch-operator
- Create basic OpenSearchCluster using below cluster.yaml.
- I omitted security config, otherwise, this is the one we used.
- Add custom_models under /usr/shared/opensearch using "OpenSearchCluster.spec.general.additionalVolumes". We use "huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1".
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
name: opensearch-cluster
spec:
security:
# TODO: may have to set up some security params here.
general:
image: opensearchproject/opensearch
version: 2.19.0
serviceName: opensearch-cluster
additionalConfig:
# Disable running models only on dedicated 'ml' nodes.
plugins.ml_commons.only_run_on_ml_node: "false"
# Allow registering custom models.
plugins.ml_commons.allow_registering_model_via_url: "true"
# These thresholds trigger this message "Memory Circuit Breaker is open, please check your resources"
# It's unclear how the memory is calculated, setting these to 100 to disable.
# https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#set-native-memory-threshold
plugins.ml_commons.native_memory_threshold: "100"
plugins.ml_commons.jvm_heap_memory_threshold: "100"
# Search thread pool parameters.
thread_pool.search.queue_size: "1000"
thread_pool.search.size: "30"
nodePools:
- component: nodes
replicas: 3
diskSize: 8Gi
resources:
requests:
memory: "4Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "500m"
roles: [ cluster_manager, data, ml, ingest ]
env:
# Disable demo security config.
- name: DISABLE_INSTALL_DEMO_CONFIG
value: "true"
- Wait for the cluster to have green status. Watch for status using below:
kubectl describe OpenSearchCluster
- Call register model using "_plugins/_ml/models/_register", follow the guide for custom models in the docs.
Inspecting the code for why the model task didn't fail when that exception happened would by the first attempt to fix the issue. It must fail.
Thank you for providing the detailed steps @maxlepikhin. I was able to setup minikube locally, and I will follow the steps and see if I can reproduce the issue on my end.
@nathaliellenaa note that this happens when the cluster is in "green" status but shortly after it becomes "green". The workaround on our side was to give the cluster more time (1-2 minutes) after it first becomes "green" before deploying the model.
Hi @maxlepikhin. I'm trying to setup the cluster, but I couldn't see the health status. Here are the steps that I follow:
- Install the operator
helm repo add opensearch-operator https://opensearch-project.github.io/opensearch-k8s-operator/
helm install opensearch-operator opensearch-operator/opensearch-operator
- Set up my cluster.yaml file
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
name: opensearch-cluster
spec:
security:
# TODO: may have to set up some security params here.
config:
adminCredentialsSecret:
name: admin-credentials-secret
securityConfigSecret:
name: securityconfig-secret
tls:
transport:
generate: true
http:
generate: true
general:
image: opensearchproject/opensearch
version: 2.19.0
serviceName: opensearch-cluster
additionalConfig:
# Disable running models only on dedicated 'ml' nodes.
plugins.ml_commons.only_run_on_ml_node: "false"
# Allow registering custom models.
plugins.ml_commons.allow_registering_model_via_url: "true"
# These thresholds trigger this message "Memory Circuit Breaker is open, please check your resources"
# It's unclear how the memory is calculated, setting these to 100 to disable.
# https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#set-native-memory-threshold
plugins.ml_commons.native_memory_threshold: "100"
plugins.ml_commons.jvm_heap_memory_threshold: "100"
# Search thread pool parameters.
thread_pool.search.queue_size: "1000"
thread_pool.search.size: "30"
nodePools:
- component: nodes
replicas: 3
diskSize: 8Gi
resources:
requests:
memory: "4Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "500m"
roles: [ cluster_manager, data, ml, ingest ]
env:
# Disable demo security config.
- name: DISABLE_INSTALL_DEMO_CONFIG
value: "true"
- Apply the configuration
kubectl apply -f cluster.yaml
- Watch the cluster status
% kubectl describe opensearchcluster opensearch-cluster
Name: opensearch-cluster
Namespace: default
Labels: <none>
Annotations: <none>
API Version: opensearch.opster.io/v1
Kind: OpenSearchCluster
Metadata:
Creation Timestamp: 2025-03-06T23:57:46Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2025-03-07T00:22:51Z
Finalizers:
Opster
Generation: 4
Resource Version: 101328
UID: 4c10194d-d9b9-46ce-8411-c55b9126f5e3
Spec:
Conf Mgmt:
Dashboards:
Opensearch Credentials Secret:
Replicas: 0
Resources:
Version:
General:
Additional Config:
plugins.ml_commons.allow_registering_model_via_url: true
plugins.ml_commons.jvm_heap_memory_threshold: 100
plugins.ml_commons.native_memory_threshold: 100
plugins.ml_commons.only_run_on_ml_node: false
thread_pool.search.queue_size: 1000
thread_pool.search.size: 30
Http Port: 9200
Image: opensearchproject/opensearch
Service Name: opensearch-cluster
Version: 2.19.0
Node Pools:
Component: nodes
Disk Size: 8Gi
Env:
Name: DISABLE_INSTALL_DEMO_CONFIG
Value: true
Replicas: 3
Resources:
Limits:
Cpu: 500m
Memory: 4Gi
Requests:
Cpu: 500m
Memory: 4Gi
Roles:
cluster_manager
data
ml
ingest
Security:
Config:
Admin Credentials Secret:
Name: admin-credentials-secret
Admin Secret:
Security Config Secret:
Name: securityconfig-secret
Tls:
Http:
Ca Secret:
Generate: true
Secret:
Transport:
Ca Secret:
Generate: true
Secret:
Status:
Components Status:
Phase: RUNNING
Version: 2.19.0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Security 39m containerset-controller Starting securityconfig update job
The health status is not showing up in the OpenSearchCluster resource status. I've completed the initial setup and the cluster appears to be running, but I can't see the health information. Do I miss something during the setup?
@nathaliellenaa you can try "kubectl get opensearchcluster opensearch-cluster -o yam" or use curl to probe health endpoint, possibly it's a difference between OS's.
@maxlepikhin I was able to create the OpenSearch cluster through opensearch operator, and run the register and deploy API. But I couldn't replicate the error you encountered
[2025-02-25T01:14:55,229][ERROR][o.o.m.m.MLModelManager ] [opensearch-cluster-nodes-0] Failed to retrieve model EMCqOpUB3vMjZnAnVMU_
This is the log of my cluster, and we can see here that the cluster is GREEN at 22:36:51, and it successfully deploys at 22:36:52, which is shortly after it becomes GREEN as you mentioned here. I also tried to run this process several times and still couldn't reproduce the error.
[2025-03-12T22:36:51,994][INFO ][o.o.c.r.a.AllocationService] [opensearch-cluster-nodes-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.plugins-ml-model][0]]]).
[2025-03-12T22:36:52,168][WARN ][o.o.c.r.a.AllocationService] [opensearch-cluster-nodes-0] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
[2025-03-12T22:36:52,287][INFO ][o.o.m.e.a.DLModel ] [opensearch-cluster-nodes-0] Model gcR4jJUB5Ou3cRm72isW is successfully deployed on 1 devices
[2025-03-12T22:36:52,293][INFO ][o.o.m.a.MLModelAutoReDeployer] [opensearch-cluster-nodes-0] No models needs to be auto redeployed!
[2025-03-12T22:36:52,293][INFO ][o.o.m.c.MLCommonsClusterManagerEventListener] [opensearch-cluster-nodes-0] Starting ML sync up job...
[2025-03-12T22:36:52,295][INFO ][o.o.m.a.f.TransportForwardAction] [opensearch-cluster-nodes-0] deploy model done with state: DEPLOYED, model id: gcR4jJUB5Ou3cRm72isW