ml-commons [BUG] model deployment DOES NOT fail when there is exception "Failed to retrieve model" due to TransportService "discovery node must not be null"

What is the bug? When using OpenSearch oeprator, the following bug appears possibly when the cluster is not yet fully initialized. It is showing "green" status when the model is being deployed. There is no recovery from this error as the model task is in RUNNING state indefinitely after this exception. The bug must be in how the transport service is set up but also in the MLModelManager not changing the model task to FAILED after this exception.

[2025-02-25T01:14:55,229][ERROR][o.o.m.m.MLModelManager   ] [opensearch-cluster-nodes-0] Failed to retrieve model EMCqOpUB3vMjZnAnVMU_
java.lang.NullPointerException: discovery node must not be null
	at java.base/java.util.Objects.requireNonNull(Objects.java:259) ~[?:?]
	at org.opensearch.transport.TransportService.isLocalNode(TransportService.java:1669) ~[opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.transport.TransportService.getConnection(TransportService.java:906) ~[opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:850) ~[opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$3(TransportDeployModelOnNodeAction.java:191) ~[?:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.ml.model.MLModelManager.lambda$deployModel$49(MLModelManager.java:1289) ~[?:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$77(MLModelManager.java:2150) [opensearch-ml-2.19.0.0.jar:2.19.0.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.0.jar:2.19.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-02-25T01:14:55,280][ERROR][o.o.m.m.MLModelManager   ] [opensearch-cluster-nodes-0] Failed to retrieve model chunk EMCqOpUB3vMjZnAnVMU__9
java.lang.NullPointerException: discovery node must not be null
	at java.base/java.util.Objects.requireNonNull(Objects.java:259) ~[?:?]
	at org.opensearch.transport.TransportService.isLocalNode(TransportService.java:1669) ~[opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.transport.TransportService.getConnection(TransportService.java:906) ~[opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:850) ~[opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$3(TransportDeployModelOnNodeAction.java:191) ~[?:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:311) ~[opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.ml.model.MLModelManager.handleDeployModelException(MLModelManager.java:1532) ~[?:?]
	at org.opensearch.ml.model.MLModelManager.lambda$deployModel$50(MLModelManager.java:1294) [opensearch-ml-2.19.0.0.jar:2.19.0.0]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$77(MLModelManager.java:2150) ~[?:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.19.0.jar:2.19.0]
	at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.0.jar:2.19.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.0.jar:2.19.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

How can one reproduce the bug?

Install opensearch operator. See example nodePools section of the helm chart below:

  nodePools:
    - component: nodes
      replicas: 3
      diskSize: 8Gi
      resources:
        requests:
          memory: "4Gi"
          cpu: "500m"
        limits:
          memory: "4Gi"
          cpu: "500m"
      roles: [ cluster_manager, data, ml, ingest ]
      env:
        # Disable demo security config.
        - name: DISABLE_INSTALL_DEMO_CONFIG
          value: "true"
          probes:
            startup:
              initialDelaySeconds: 30
              periodSeconds: 20
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 30

Deploy a custom model via API.
Observe the error in ~50% of the cases.

What is the expected behavior? At the minimum, the model task must fail in case of such exceptions. Ideally, the transport service is initialized correctly so it does not fail.

What is your host/environment? OS: Ubuntu 24.04 Version: 2.19.0

Do you have any screenshots? N/A

Do you have any additional context? This is critical for us. The work-around is bad - after certain time of observing ML task in RUNNING state we can abandon it and retry.

Feb 25 '25 01:02 maxlepikhin

@nathaliellenaa could you please try to reproduce the issue in your end?

@maxlepikhin if you can give more step by step process for @nathaliellenaa to reproduce the issue, that'll be helpful.

Feb 25 '25 01:02 dhrubo-os

Few questions:

I assume this is a local model. Is that correct?
Have you tried any pre-trained models? Does the issue persist with all types of local models?
How did you set up the OpenSearch operator?
Did the issue only resurface in the 2.19 release?

Feb 25 '25 01:02 dhrubo-os

I assume this is a local model. Is that correct?

The model is local in the sense that it is mounted to the opensearch container under /usr/shared/opensearch/custom_models. The deployModel REST call points to that path. In a single replica deployment w/o the operator it always worked w/o a problem. Somehow multiple replicas cause this issue. Again, at a minimum the deployment task must fail so that it can be retried from the client side.

Have you tried any pre-trained models? Does the issue persist with all types of local models?

This is one of the pre-trained models, our requirement is that we must deploy from our repository w/o allowing access to other websites. So the model.zip, config.json are actually baked into the image.

How did you set up the OpenSearch operator?

Per operator guide:

helm repo add opensearch-operator https://opensearch-project.github.io/opensearch-k8s-operator/
helm install opensearch-operator opensearch-operator/opensearch-operator

Did the issue only resurface in the 2.19 release?

I did not test earlier version with the operator. We are actually waiting eagerly 2.19.1 as it has hot reload of TLS certs.

Feb 25 '25 03:02 maxlepikhin

More detailed steps for repro in a minikube for example (setting up minikube locally is beyond the steps):

Install the operator:

helm repo add opensearch-operator https://opensearch-project.github.io/opensearch-k8s-operator/
helm install opensearch-operator opensearch-operator/opensearch-operator

Create basic OpenSearchCluster using below cluster.yaml.
- I omitted security config, otherwise, this is the one we used.
- Add custom_models under /usr/shared/opensearch using "OpenSearchCluster.spec.general.additionalVolumes". We use "huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1".

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch-cluster
spec:
  security:
     # TODO: may have to set up some security params here.
  general:
    image: opensearchproject/opensearch
    version: 2.19.0
    serviceName: opensearch-cluster
    additionalConfig:
      # Disable running models only on dedicated 'ml' nodes.
      plugins.ml_commons.only_run_on_ml_node: "false"

      # Allow registering custom models.
      plugins.ml_commons.allow_registering_model_via_url: "true"

      # These thresholds trigger this message "Memory Circuit Breaker is open, please check your resources"
      # It's unclear how the memory is calculated, setting these to 100 to disable.
      #   https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#set-native-memory-threshold
      plugins.ml_commons.native_memory_threshold: "100"
      plugins.ml_commons.jvm_heap_memory_threshold: "100"

      # Search thread pool parameters.
      thread_pool.search.queue_size: "1000"
      thread_pool.search.size: "30"

  nodePools:
    - component: nodes
      replicas: 3
      diskSize: 8Gi
      resources:
        requests:
          memory: "4Gi"
          cpu: "500m"
        limits:
          memory: "4Gi"
          cpu: "500m"
      roles: [ cluster_manager, data, ml, ingest ]
      env:
        # Disable demo security config.
        - name: DISABLE_INSTALL_DEMO_CONFIG
          value: "true"

Wait for the cluster to have green status. Watch for status using below:

kubectl describe OpenSearchCluster

Call register model using "_plugins/_ml/models/_register", follow the guide for custom models in the docs.

Inspecting the code for why the model task didn't fail when that exception happened would by the first attempt to fix the issue. It must fail.

Feb 25 '25 03:02 maxlepikhin

Thank you for providing the detailed steps @maxlepikhin. I was able to setup minikube locally, and I will follow the steps and see if I can reproduce the issue on my end.

Feb 25 '25 03:02 nathaliellenaa

@nathaliellenaa note that this happens when the cluster is in "green" status but shortly after it becomes "green". The workaround on our side was to give the cluster more time (1-2 minutes) after it first becomes "green" before deploying the model.

Mar 06 '25 23:03 maxlepikhin

Hi @maxlepikhin. I'm trying to setup the cluster, but I couldn't see the health status. Here are the steps that I follow:

Install the operator

helm repo add opensearch-operator https://opensearch-project.github.io/opensearch-k8s-operator/
helm install opensearch-operator opensearch-operator/opensearch-operator

Set up my cluster.yaml file

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch-cluster
spec:
  security:
     # TODO: may have to set up some security params here.
    config:
      adminCredentialsSecret:
        name: admin-credentials-secret
      securityConfigSecret:
        name: securityconfig-secret
    tls:
      transport:
        generate: true
      http:
        generate: true
  general:
    image: opensearchproject/opensearch
    version: 2.19.0
    serviceName: opensearch-cluster
    additionalConfig:
      # Disable running models only on dedicated 'ml' nodes.
      plugins.ml_commons.only_run_on_ml_node: "false"

      # Allow registering custom models.
      plugins.ml_commons.allow_registering_model_via_url: "true"

      # These thresholds trigger this message "Memory Circuit Breaker is open, please check your resources"
      # It's unclear how the memory is calculated, setting these to 100 to disable.
      #   https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#set-native-memory-threshold
      plugins.ml_commons.native_memory_threshold: "100"
      plugins.ml_commons.jvm_heap_memory_threshold: "100"

      # Search thread pool parameters.
      thread_pool.search.queue_size: "1000"
      thread_pool.search.size: "30"

  nodePools:
    - component: nodes
      replicas: 3
      diskSize: 8Gi
      resources:
        requests:
          memory: "4Gi"
          cpu: "500m"
        limits:
          memory: "4Gi"
          cpu: "500m"
      roles: [ cluster_manager, data, ml, ingest ]
      env:
        # Disable demo security config.
        - name: DISABLE_INSTALL_DEMO_CONFIG
          value: "true"

Apply the configuration

kubectl apply -f cluster.yaml

Watch the cluster status

% kubectl describe opensearchcluster opensearch-cluster

Name:         opensearch-cluster
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  opensearch.opster.io/v1
Kind:         OpenSearchCluster
Metadata:
  Creation Timestamp:             2025-03-06T23:57:46Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2025-03-07T00:22:51Z
  Finalizers:
    Opster
  Generation:        4
  Resource Version:  101328
  UID:               4c10194d-d9b9-46ce-8411-c55b9126f5e3
Spec:
  Conf Mgmt:
  Dashboards:
    Opensearch Credentials Secret:
    Replicas:  0
    Resources:
    Version:  
  General:
    Additional Config:
      plugins.ml_commons.allow_registering_model_via_url:  true
      plugins.ml_commons.jvm_heap_memory_threshold:        100
      plugins.ml_commons.native_memory_threshold:          100
      plugins.ml_commons.only_run_on_ml_node:              false
      thread_pool.search.queue_size:                       1000
      thread_pool.search.size:                             30
    Http Port:                                             9200
    Image:                                                 opensearchproject/opensearch
    Service Name:                                          opensearch-cluster
    Version:                                               2.19.0
  Node Pools:
    Component:  nodes
    Disk Size:  8Gi
    Env:
      Name:    DISABLE_INSTALL_DEMO_CONFIG
      Value:   true
    Replicas:  3
    Resources:
      Limits:
        Cpu:     500m
        Memory:  4Gi
      Requests:
        Cpu:     500m
        Memory:  4Gi
    Roles:
      cluster_manager
      data
      ml
      ingest
  Security:
    Config:
      Admin Credentials Secret:
        Name:  admin-credentials-secret
      Admin Secret:
      Security Config Secret:
        Name:  securityconfig-secret
    Tls:
      Http:
        Ca Secret:
        Generate:  true
        Secret:
      Transport:
        Ca Secret:
        Generate:  true
        Secret:
Status:
  Components Status:
  Phase:    RUNNING
  Version:  2.19.0
Events:
  Type    Reason    Age   From                     Message
  ----    ------    ----  ----                     -------
  Normal  Security  39m   containerset-controller  Starting securityconfig update job

The health status is not showing up in the OpenSearchCluster resource status. I've completed the initial setup and the cluster appears to be running, but I can't see the health information. Do I miss something during the setup?

Mar 07 '25 01:03 nathaliellenaa

@nathaliellenaa you can try "kubectl get opensearchcluster opensearch-cluster -o yam" or use curl to probe health endpoint, possibly it's a difference between OS's.

Mar 07 '25 16:03 maxlepikhin

@maxlepikhin I was able to create the OpenSearch cluster through opensearch operator, and run the register and deploy API. But I couldn't replicate the error you encountered

[2025-02-25T01:14:55,229][ERROR][o.o.m.m.MLModelManager   ] [opensearch-cluster-nodes-0] Failed to retrieve model EMCqOpUB3vMjZnAnVMU_

This is the log of my cluster, and we can see here that the cluster is GREEN at 22:36:51, and it successfully deploys at 22:36:52, which is shortly after it becomes GREEN as you mentioned here. I also tried to run this process several times and still couldn't reproduce the error.

[2025-03-12T22:36:51,994][INFO ][o.o.c.r.a.AllocationService] [opensearch-cluster-nodes-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.plugins-ml-model][0]]]).
[2025-03-12T22:36:52,168][WARN ][o.o.c.r.a.AllocationService] [opensearch-cluster-nodes-0] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
[2025-03-12T22:36:52,287][INFO ][o.o.m.e.a.DLModel        ] [opensearch-cluster-nodes-0] Model gcR4jJUB5Ou3cRm72isW is successfully deployed on 1 devices
[2025-03-12T22:36:52,293][INFO ][o.o.m.a.MLModelAutoReDeployer] [opensearch-cluster-nodes-0] No models needs to be auto redeployed!
[2025-03-12T22:36:52,293][INFO ][o.o.m.c.MLCommonsClusterManagerEventListener] [opensearch-cluster-nodes-0] Starting ML sync up job...
[2025-03-12T22:36:52,295][INFO ][o.o.m.a.f.TransportForwardAction] [opensearch-cluster-nodes-0] deploy model done with state: DEPLOYED, model id: gcR4jJUB5Ou3cRm72isW

Mar 12 '25 22:03 nathaliellenaa

ml-commons ml-commons copied to clipboard

[BUG] model deployment DOES NOT fail when there is exception "Failed to retrieve model" due to TransportService "discovery node must not be null"

ml-commons
ml-commons copied to clipboard