KubeFATE
KubeFATE copied to clipboard
serving api return "serviceId is not bind model" after pod restarted
**What deployment mode you are use? ** 2. Kuberentes.
**What KubeFATE and FATE version you are using? ** fate: 1.8.0 kubefate: 1.8.0 fate-serving: 2.1.5
**What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. **
- OS: ubuntu
- Version 20.04
To Reproduce
Hi mates, I have deployed fate cluster with following components in 3 of my virtual machines with k3s and kubefate and persistent volumes:
- fate-9999
- fate-10000
- fate-exchange
After all pods up, I tried the toy with following steps:
- exec in client pod of fate-9999:
flow test toy -gid 9999 -hid 10000
- exec in client pod of date-10000:
flow test toy -gid 10000 -hid 9999
It's working fine.
I tried federated training and serving with following steps:
- upload data from client pod of fate-10000:
flow data upload -c fateflow/examples/upload/upload_host.json
- upload date from client pod of fate-9999:
flow data upload -c fateflow/examples/upload/upload_guest.json
- start training from client pod of fate-9999:
flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json
Training success.
- deploy model from client pod of fate-9999:
flow model deploy --model-id arbiter-10000#guest-9999#host-10000#model --model-version 202205180944078730290
- load model from client pod of fate-9999:
flow model load -c fateflow/examples/model/publish_load_model.json
- bind model from client pod of fate-9999:
flow model bind -c fateflow/examples/model/bind_model_service.json
Model serving success.
Then I test the deployed service id with command:
curl -X POST -H 'Content-Type: application/json' -i 'http://9999.serving-proxy.cluster01.com/federation/v1/inference' --data '{
"head": {
"serviceId": "202208020103166804320"
},
"body": {
"featureData": {
"x0": 1.88669,
"x1": -1.359293,
"x2": 2.303601,
"x3": 2.00137,
"x4": 1.307686
},
"sendToRemoteFeatureData": {
"phone_num": "122222222"
}
}
}'
It works fine with response code '0'.
While, after I restarted the virtual machines and test the deployed service id with the same command.
What happen?
The api response with code 104
message serviceId is not bind model
{"retcode":104,"retmsg":"serviceId is not bind model","data":{},"flag":0}
Additional context
I have checked the persistent volume path, the model files exist in the file system.
>> pwd
nfs/9999/kubefate/python/model-local-cache/guest#9999#arbiter-10000#guest-9999#host-10000#model
>> ls
202208011320263190340 202208020103166804320
I have tried the log the serving pod serving-proxy-58474f6bd4-6tn8d
while exec the command to test deployed service id.
Follwing lines in the log:
2022-08-03 01:09:58,232 [INFO ] c.w.a.f.s.p.r.r.ZkServingRouter(ZkServingRouter.java:64) - try to find zk ,serving:202208020103166804320:inference, result null
2022-08-03 01:09:58,232 [INFO ] c.w.a.f.s.p.r.r.BaseServingRouter(BaseServingRouter.java:69) - caseid 1d6925c46a1946a68caae44d38bb1891 get route info serving-server:8000
After I rerun the load model and bind model steps, the test commnad succeed. And the logs in pod serving-proxy-58474f6bd4-6tn8d
2022-08-03 01:18:43,986 [INFO ] c.w.a.f.s.p.r.r.ZkServingRouter(ZkServingRouter.java:64) - try to find zk ,serving:202208020103166804320:inference, result [grpc://10.42.0.193:8000/serving/202208020103166804320/inference?router_mode=ALL_ALLOWED×tamp=1659489177401&version=215]
2022-08-03 01:18:43,986 [INFO ] c.w.a.f.s.p.r.r.BaseServingRouter(BaseServingRouter.java:69) - caseid d953dc35994e4209a96046df7ddbeefa get route info 10.42.0.193:8000
2022-08-03 01:18:43,991 [INFO ] c.w.a.f.s.p.r.r.BaseServingRouter(BaseServingRouter.java:69) - caseid 1659489523990 get route info 10.1.1.11:30006
I suggest it could be the zookeeper persistent issue and I will dig into it soon. Do you guys have any idea on this? Thanks a looooooooooooooooot.
Scree shots of zk, port 32001
is the nodeport service i created to output the port 2181
of pod serving-zookeeper-0
.
Before restart:
After restart:
Hi @wood-j, Did you deploy fate-serving with persistence enabled? persistence: true
@owlet42 Hi owlet, we have set persistence: true
in our cluster.yml and cluster_serving.yml of fate-9999
and fate-10000
, but false
of exchange
, is that the issue?
After I set the persistence: true
to update the exchange&exchange-serving and do previous restart and test steps, still the same 104
.
@wood-j Could you please show the cluster.yaml file of your exchange for further check?
Hi @JingChen23, here is the cluster.yaml content of exchange cluster.
name: fate-exchange
namespace: fate-exchange
chartName: fate-exchange
chartVersion: v1.8.0
partyId: 1 #<<
registry: "10.1.1.1:4999/federatedai" #<<
imageTag: 1.8.0-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: true
istio:
enabled: false
podSecurityPolicy:
enabled: false
modules:
- rollsite
rollsite:
type: NodePort
nodePort: 30000 #<<
partyList:
- partyId: 9999 #<<
partyIp: 10.1.1.11 #<<
partyPort: 30091 #<<
- partyId: 10000 #<<
partyIp: 10.1.1.12 #<<
partyPort: 30091 #<<
By the way, I tried to deploy a similar cluster with docker-compose from this doc Added persistent volume to zookeeper for serving:
serving-zookeeper:
image: "bitnami/zookeeper:3.7.0"
ports:
- "2181:2181"
- "2888"
- "3888"
# +++
volumes:
- ./volume/zookeeper:/bitnami/zookeeper
Reappeared this issue by recreate serving server container by exec follwing command in /data/projects/fate/serving-9999
:
CNTR=serving-9999_serving-server_1 && docker stop $CNTR && docker rm $CNTR
docker-compose up -d
Oops, I wanted to say cluster_serving.yml but somehow I let yot to show the exchange.yaml.
Could you please also show cluster_serving.yml? So that we can try to reproduce.
@JingChen23
Oops, I wanted to say cluster_serving.yml but somehow I let yot to show the exchange.yaml.
Could you please also show cluster_serving.yml? So that we can try to reproduce.
name: fate-exchange-serving
namespace: fate-exchange-serving
chartName: fate-serving
chartVersion: v2.1.5
partyId: 2
registry: "10.1.1.1:4999/federatedai" #<<
imageTag: 2.1.5-release
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: true
istio:
enabled: false
podSecurityPolicy:
enabled: false
modules:
- servingProxy
servingProxy:
nodePort: 30006 #<<
type: NodePort
partyList:
- partyId: 9999 #<<
partyIp: 10.1.1.11 #<<
partyPort: 30096 #<<
- partyId: 10000 #<<
partyIp: 10.1.1.12 #<<
partyPort: 30096 #<<
@wood-j Our human resource is limited, we can start the reproduce from next week, please understand.
Hi @JingChen23 @owlet42 Guys, glad to tell you that this has been solved. The direct cause is the persistent issue of serving server container(pod), the following path is not included in the volumes:
- relative path:
./.fate/
which absolute path is/data/projects/fate-serving/serving-server/.fate
Backp
Optional:
- backup your cache in container path:
/data/projects/fate-serving/serving-server/.fate
- copy your backup data to the persistent volume of serving server
For docker-compose deployment:
Update docker-deploy/serving_template/docker-compose-serving.yml
:
services:
serving-server:
# ....
volumes:
# ++++
- ./confs/serving-server/model_cache_path:/data/projects/fate-serving/serving-server/.fate
# ...
serving-zookeeper:
# +++
volumes:
- ./confs/serving-server/zookeeper:/bitnami/zookeeper
# ...
Then redeply will be fine.
For k8s deployment,
Need to update helm-charts/FATE-Serving/templates/serving-server-module.yaml
# ...
spec:
containers:
- image: {{ .Values.image.registry }}/serving-server:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
name: serving-server
ports:
- containerPort: 9394
volumeMounts:
- mountPath: /data/projects/fate-serving/serving-server/conf/serving-server.properties
name: serving-server-confs
subPath: serving-server.properties
# +-+-+-
- name: data
mountPath: /root/.fate
subPath: cache # {{ .Values.servingServer.subPath }}
- name: data
mountPath: /data/projects/fate-serving/serving-server/.fate
subPath: model_cache
# ...
- Rebuild:
- rebuild charts:
cd helm-charts && make release
- upload charts to your cluster:
kubefate charts upload -f xxxx.tar.gz
- Recreate cluster:
- delete cluster:
kubefate cluster delete -f xxxx.yaml
- create cluster:
kubefate cluster create -f xxxx.yaml
Then should be fine.
@wood-j Thanks for the finding! We can fix this by KubeFATE release v1.9.0.
@wood-j Do you mind create a pull request to branch develop-1.9.0
? We are thirsty for contributions from the community, this will make this repo thriving.
This issue has been fixed, the issue will be closed