KubeFATE icon indicating copy to clipboard operation
KubeFATE copied to clipboard

serving api return "serviceId is not bind model" after pod restarted

Open wood-j opened this issue 2 years ago • 12 comments

**What deployment mode you are use? ** 2. Kuberentes.

**What KubeFATE and FATE version you are using? ** fate: 1.8.0 kubefate: 1.8.0 fate-serving: 2.1.5

**What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. **

  • OS: ubuntu
  • Version 20.04

To Reproduce

Hi mates, I have deployed fate cluster with following components in 3 of my virtual machines with k3s and kubefate and persistent volumes:

  • fate-9999
  • fate-10000
  • fate-exchange

After all pods up, I tried the toy with following steps:

  • exec in client pod of fate-9999: flow test toy -gid 9999 -hid 10000
  • exec in client pod of date-10000: flow test toy -gid 10000 -hid 9999

It's working fine.

I tried federated training and serving with following steps:

  • upload data from client pod of fate-10000: flow data upload -c fateflow/examples/upload/upload_host.json
  • upload date from client pod of fate-9999: flow data upload -c fateflow/examples/upload/upload_guest.json
  • start training from client pod of fate-9999: flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json

Training success.

  • deploy model from client pod of fate-9999: flow model deploy --model-id arbiter-10000#guest-9999#host-10000#model --model-version 202205180944078730290
  • load model from client pod of fate-9999: flow model load -c fateflow/examples/model/publish_load_model.json
  • bind model from client pod of fate-9999: flow model bind -c fateflow/examples/model/bind_model_service.json

Model serving success.

Then I test the deployed service id with command:

curl -X POST -H 'Content-Type: application/json' -i 'http://9999.serving-proxy.cluster01.com/federation/v1/inference' --data '{
    "head": {
        "serviceId": "202208020103166804320"
    },
    "body": {
        "featureData": {
            "x0": 1.88669,
            "x1": -1.359293,
            "x2": 2.303601,
            "x3": 2.00137,
            "x4": 1.307686
        },
        "sendToRemoteFeatureData": {
            "phone_num": "122222222"
        }
    }
}'

It works fine with response code '0'.

While, after I restarted the virtual machines and test the deployed service id with the same command.

What happen?

The api response with code 104 message serviceId is not bind model

{"retcode":104,"retmsg":"serviceId is not bind model","data":{},"flag":0}

Additional context

I have checked the persistent volume path, the model files exist in the file system.

>> pwd
nfs/9999/kubefate/python/model-local-cache/guest#9999#arbiter-10000#guest-9999#host-10000#model
>> ls
202208011320263190340  202208020103166804320

I have tried the log the serving pod serving-proxy-58474f6bd4-6tn8d while exec the command to test deployed service id. Follwing lines in the log:

2022-08-03 01:09:58,232 [INFO ] c.w.a.f.s.p.r.r.ZkServingRouter(ZkServingRouter.java:64) - try to find zk ,serving:202208020103166804320:inference, result null
2022-08-03 01:09:58,232 [INFO ] c.w.a.f.s.p.r.r.BaseServingRouter(BaseServingRouter.java:69) - caseid 1d6925c46a1946a68caae44d38bb1891 get route info serving-server:8000

After I rerun the load model and bind model steps, the test commnad succeed. And the logs in pod serving-proxy-58474f6bd4-6tn8d

2022-08-03 01:18:43,986 [INFO ] c.w.a.f.s.p.r.r.ZkServingRouter(ZkServingRouter.java:64) - try to find zk ,serving:202208020103166804320:inference, result [grpc://10.42.0.193:8000/serving/202208020103166804320/inference?router_mode=ALL_ALLOWED&timestamp=1659489177401&version=215]
2022-08-03 01:18:43,986 [INFO ] c.w.a.f.s.p.r.r.BaseServingRouter(BaseServingRouter.java:69) - caseid d953dc35994e4209a96046df7ddbeefa get route info 10.42.0.193:8000
2022-08-03 01:18:43,991 [INFO ] c.w.a.f.s.p.r.r.BaseServingRouter(BaseServingRouter.java:69) - caseid 1659489523990 get route info 10.1.1.11:30006

I suggest it could be the zookeeper persistent issue and I will dig into it soon. Do you guys have any idea on this? Thanks a looooooooooooooooot.

wood-j avatar Aug 03 '22 01:08 wood-j

Scree shots of zk, port 32001 is the nodeport service i created to output the port 2181 of pod serving-zookeeper-0.

Before restart:

After restart:

wood-j avatar Aug 03 '22 02:08 wood-j

Hi @wood-j, Did you deploy fate-serving with persistence enabled? persistence: true

owlet42 avatar Aug 03 '22 02:08 owlet42

@owlet42 Hi owlet, we have set persistence: true in our cluster.yml and cluster_serving.yml of fate-9999 and fate-10000, but false of exchange, is that the issue? After I set the persistence: true to update the exchange&exchange-serving and do previous restart and test steps, still the same 104.

wood-j avatar Aug 03 '22 02:08 wood-j

@wood-j Could you please show the cluster.yaml file of your exchange for further check?

JingChen23 avatar Aug 04 '22 01:08 JingChen23

Hi @JingChen23, here is the cluster.yaml content of exchange cluster.

name: fate-exchange
namespace: fate-exchange
chartName: fate-exchange
chartVersion: v1.8.0
partyId: 1 #<<
registry: "10.1.1.1:4999/federatedai" #<<
imageTag: 1.8.0-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: true
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
modules:
  - rollsite

rollsite:
  type: NodePort
  nodePort: 30000 #<<
  partyList:
  - partyId: 9999 #<<
    partyIp: 10.1.1.11 #<<
    partyPort: 30091 #<<
  - partyId: 10000 #<<
    partyIp: 10.1.1.12 #<<
    partyPort: 30091 #<<

wood-j avatar Aug 04 '22 02:08 wood-j

By the way, I tried to deploy a similar cluster with docker-compose from this doc Added persistent volume to zookeeper for serving:

  serving-zookeeper:
    image: "bitnami/zookeeper:3.7.0"
    ports:
      - "2181:2181"
      - "2888"
      - "3888"
    # +++
    volumes:
      - ./volume/zookeeper:/bitnami/zookeeper

Reappeared this issue by recreate serving server container by exec follwing command in /data/projects/fate/serving-9999:

CNTR=serving-9999_serving-server_1 && docker stop $CNTR && docker rm $CNTR
docker-compose up -d

wood-j avatar Aug 04 '22 02:08 wood-j

Oops, I wanted to say cluster_serving.yml but somehow I let yot to show the exchange.yaml.

Could you please also show cluster_serving.yml? So that we can try to reproduce.

JingChen23 avatar Aug 04 '22 02:08 JingChen23

@JingChen23

Oops, I wanted to say cluster_serving.yml but somehow I let yot to show the exchange.yaml.

Could you please also show cluster_serving.yml? So that we can try to reproduce.

name: fate-exchange-serving
namespace: fate-exchange-serving
chartName: fate-serving
chartVersion: v2.1.5
partyId: 2
registry: "10.1.1.1:4999/federatedai" #<<
imageTag: 2.1.5-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: true
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
modules:
  - servingProxy

servingProxy:
  nodePort: 30006 #<<
  type: NodePort
  partyList:
  - partyId: 9999 #<<
    partyIp: 10.1.1.11 #<<
    partyPort: 30096 #<<
  - partyId: 10000 #<<
    partyIp: 10.1.1.12 #<<
    partyPort: 30096 #<<

wood-j avatar Aug 04 '22 02:08 wood-j

@wood-j Our human resource is limited, we can start the reproduce from next week, please understand.

JingChen23 avatar Aug 04 '22 02:08 JingChen23

Hi @JingChen23 @owlet42 Guys, glad to tell you that this has been solved. The direct cause is the persistent issue of serving server container(pod), the following path is not included in the volumes:

  • relative path: ./.fate/ which absolute path is /data/projects/fate-serving/serving-server/.fate

Backp

Optional:

  • backup your cache in container path: /data/projects/fate-serving/serving-server/.fate
  • copy your backup data to the persistent volume of serving server

For docker-compose deployment:

Update docker-deploy/serving_template/docker-compose-serving.yml:

services:
  serving-server:
    # ....
    volumes:
      # ++++
      - ./confs/serving-server/model_cache_path:/data/projects/fate-serving/serving-server/.fate
# ...

  serving-zookeeper:
    # +++
    volumes:
      - ./confs/serving-server/zookeeper:/bitnami/zookeeper
# ...

Then redeply will be fine.

For k8s deployment,

Need to update helm-charts/FATE-Serving/templates/serving-server-module.yaml

# ...
    spec:
      containers:
        - image: {{ .Values.image.registry }}/serving-server:{{ .Values.image.tag }}
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          name: serving-server
          ports:
            - containerPort: 9394
          volumeMounts:
            - mountPath: /data/projects/fate-serving/serving-server/conf/serving-server.properties
              name: serving-server-confs
              subPath: serving-server.properties
# +-+-+-
            - name: data
              mountPath: /root/.fate
              subPath: cache # {{ .Values.servingServer.subPath }}
            - name: data
              mountPath: /data/projects/fate-serving/serving-server/.fate
              subPath: model_cache
# ...
  1. Rebuild:
  • rebuild charts: cd helm-charts && make release
  • upload charts to your cluster: kubefate charts upload -f xxxx.tar.gz
  1. Recreate cluster:
  • delete cluster: kubefate cluster delete -f xxxx.yaml
  • create cluster: kubefate cluster create -f xxxx.yaml

Then should be fine.

wood-j avatar Aug 04 '22 10:08 wood-j

@wood-j Thanks for the finding! We can fix this by KubeFATE release v1.9.0.

JingChen23 avatar Aug 09 '22 06:08 JingChen23

@wood-j Do you mind create a pull request to branch develop-1.9.0? We are thirsty for contributions from the community, this will make this repo thriving.

JingChen23 avatar Aug 09 '22 06:08 JingChen23

This issue has been fixed, the issue will be closed

owlet42 avatar Aug 18 '22 05:08 owlet42