KubeFATE
KubeFATE copied to clipboard
v.1.9.0 flow model load error
**What deployment mode you are use? ** Kuberentes.
**What KubeFATE and FATE version you are using? ** v.1.9.0
**What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. **
- OS: [e.g. iOS]
- Version [e.g. 22]
Desktop (please complete the following information):
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]
To Reproduce
Clear how to reproduce your problem.
What happen? flow model load -c fateflow/examples/model/publish_load_model.json Clear the unexpected response.
{ "data": { "detail": { "guest": { "9999": { "retcode": 100, "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "failed to connect to all addresses"\n\tdebug_error_string = "{"created":"@1662380004.985108799","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1662380004.985108157","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"\n>" } }, "host": { "10000": { "retcode": 0, "retmsg": "success" } } }, "guest": { "9999": 100 }, "host": { "10000": 0 } }, "jobId": "202209051213249727500", "retcode": 101, "retmsg": "failed" }
flow model load -c fateflow/examples/model/publish_load_model.json
Could you list all the steps you have been through before this step?
flow model load -c fateflow/examples/model/publish_load_model.json
Could you list all the steps you have been through before this step?
party-10000
kubectl exec -it client-0 -n fate-10000 bash
flow data upload -c fateflow/examples/upload/upload_host.json
response
{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202209051315096029090&role=local&party_id=0",
"code": 0,
"dsl_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/job_dsl.json",
"job_id": "202209051315096029090",
"logs_directory": "/data/projects/fate/fateflow/logs/202209051315096029090",
"message": "success",
"model_info": {
"model_id": "local-0#model",
"model_version": "202209051315096029090"
},
"namespace": "experiment",
"pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/pipeline_dsl.json",
"runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/local/0/job_runtime_on_party_conf.json",
"runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/job_runtime_conf.json",
"table_name": "breast_hetero_host",
"train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/train_runtime_conf.json"
},
"jobId": "202209051315096029090",
"retcode": 0,
"retmsg": "success"
}
party-9999
kubectl exec -it client-0 -n fate-9999 bash
flow data upload -c fateflow/examples/upload/upload_guest.json
response
{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202209051318204094070&role=local&party_id=0",
"code": 0,
"dsl_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/job_dsl.json",
"job_id": "202209051318204094070",
"logs_directory": "/data/projects/fate/fateflow/logs/202209051318204094070",
"message": "success",
"model_info": {
"model_id": "local-0#model",
"model_version": "202209051318204094070"
},
"namespace": "experiment",
"pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/pipeline_dsl.json",
"runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/local/0/job_runtime_on_party_conf.json",
"runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/job_runtime_conf.json",
"table_name": "breast_hetero_guest",
"train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/train_runtime_conf.json"
},
"jobId": "202209051318204094070",
"retcode": 0,
"retmsg": "success"
}
flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json
response
{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202209051319590011940&role=guest&party_id=9999",
"code": 0,
"dsl_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/job_dsl.json",
"job_id": "202209051319590011940",
"logs_directory": "/data/projects/fate/fateflow/logs/202209051319590011940",
"message": "success",
"model_info": {
"model_id": "arbiter-10000#guest-9999#host-10000#model",
"model_version": "202209051319590011940"
},
"pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/pipeline_dsl.json",
"runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/guest/9999/job_runtime_on_party_conf.json",
"runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/job_runtime_conf.json",
"train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/train_runtime_conf.json"
},
"jobId": "202209051319590011940",
"retcode": 0,
"retmsg": "success"
}
flow task query -r guest -j 202209051319590011940 | grep -w f_status
response
"f_status": "success",
"f_status": "success",
"f_status": "success",
"f_status": "success",
"f_status": "success",
"f_status": "success",
"f_status": "success",
flow model deploy --model-id arbiter-10000#guest-9999#host-10000#model --model-version 202209051319590011940
response
{
"data": {
"arbiter": {
"10000": 0
},
"detail": {
"arbiter": {
"10000": {
"retcode": 0,
"retmsg": "deploy model of role arbiter 10000 success"
}
},
"guest": {
"9999": {
"retcode": 0,
"retmsg": "deploy model of role guest 9999 success"
}
},
"host": {
"10000": {
"retcode": 0,
"retmsg": "deploy model of role host 10000 success"
}
}
},
"guest": {
"9999": 0
},
"host": {
"10000": 0
},
"model_id": "arbiter-10000#guest-9999#host-10000#model",
"model_version": "202209051322265771410"
},
"retcode": 0,
"retmsg": "success"
}
modify publish_load_model.json
cat > fateflow/examples/model/publish_load_model.json <<EOF
{
"initiator": {
"party_id": "9999",
"role": "guest"
},
"role": {
"guest": [
"9999"
],
"host": [
"10000"
],
"arbiter": [
"10000"
]
},
"job_parameters": {
"model_id": "arbiter-10000#guest-9999#host-10000#model",
"model_version": "202209051322265771410"
}
}
EOF
flow model load -c fateflow/examples/model/publish_load_model.json
response
{
"data": {
"detail": {
"guest": {
"9999": {
"retcode": 100,
"retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"failed to connect to all addresses\"\n\tdebug_error_string = \"{\"created\":\"@1662384564.572906342\",\"description\":\"Failed to pick subchannel\",\"file\":\"src/core/ext/filters/client_channel/client_channel.cc\",\"file_line\":3217,\"referenced_errors\":[{\"created\":\"@1662384564.572905670\",\"description\":\"failed to connect to all addresses\",\"file\":\"src/core/lib/transport/error_utils.cc\",\"file_line\":165,\"grpc_status\":14}]}\"\n>"
}
},
"host": {
"10000": {
"retcode": 0,
"retmsg": "success"
}
}
},
"guest": {
"9999": 100
},
"host": {
"10000": 0
}
},
"jobId": "202209051329245513300",
"retcode": 101,
"retmsg": "failed"
}
what does your cluster.yaml of 9999 look like? Have you also installed a fate-serving within the same cluster?
what does your cluster.yaml of 9999 look like? Have you also installed a fate-serving within the same cluster?
yes, within the same cluster
party-9999
kube-fate mariadb-bf9dcd69b-52rdm 1/1 Running 0 19h
kube-fate kubefate-597c44897c-qqr2w 1/1 Running 4 19h
fate-9999 rollsite-68d6d478c9-w6kps 1/1 Running 0 17h
fate-9999 nodemanager-0 2/2 Running 0 17h
fate-9999 clustermanager-687b657b66-l5ghw 1/1 Running 0 17h
fate-9999 mysql-0 1/1 Running 0 17h
fate-9999 nodemanager-1 2/2 Running 0 17h
fate-9999 python-0 2/2 Running 0 17h
fate-9999 client-0 1/1 Running 0 16h
fate-serving-9999 serving-proxy-7686cddc4-sjcgk 1/1 Running 0 15h
fate-serving-9999 serving-redis-549b94cc7-djvr4 1/1 Running 0 15h
fate-serving-9999 serving-admin-57498849b7-jp5v7 1/1 Running 0 15h
fate-serving-9999 serving-server-7ff94bf986-4gnvf 1/1 Running 0 15h
fate-serving-9999 serving-zookeeper-0 1/1 Running 0 15h
party-10000
fate-10000 nodemanager-0 2/2 Running 0 17h
fate-10000 rollsite-554b49fc56-lm5h8 1/1 Running 0 17h
fate-10000 clustermanager-58fbcf745-gjgff 1/1 Running 0 17h
fate-10000 nodemanager-1 2/2 Running 0 17h
fate-10000 mysql-0 1/1 Running 0 17h
fate-10000 python-0 2/2 Running 0 17h
fate-10000 client-0 1/1 Running 0 17h
fate-serving-10000 serving-proxy-666c4974bb-4j5n4 1/1 Running 0 16h
fate-serving-10000 serving-admin-b9585b587-rmmng 1/1 Running 0 16h
fate-serving-10000 serving-server-69b5f5b5c4-zb4tw 1/1 Running 0 16h
fate-serving-10000 serving-redis-bc875b7f7-t2lzw 1/1 Running 0 16h
fate-serving-10000 serving-zookeeper-0 1/1 Running 0 16h
party-9999-yaml
cluster.yaml
name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.9.0
partyId: 9999
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: Basic
device: CPU
ingress:
fateboard:
hosts:
- name: party9999.fateboard.203.pclab
client:
hosts:
- name: party9999.notebook.203.pclab
rollsite:
type: NodePort
nodePort: 30091
partyList:
- partyId: 10000
partyIp: 10.0.1.205
partyPort: 30101
python:
type: NodePort
httpNodePort: 30097
grpcNodePort: 30092
logLevel: INFO
servingIp: 10.0.1.203
servingPort: 30095
cluster-serving.yaml
name: fate-serving-9999
namespace: fate-serving-9999
chartName: fate-serving
chartVersion: v2.1.6
partyId: 9999
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- servingProxy
- servingRedis
- servingServer
- servingZookeeper
- servingAdmin
ingress:
servingProxy:
hosts:
- name: party9999.serving-proxy.203.pclab
path: /
servingAdmin:
hosts:
- name: party9999.serving-admin.203.pclab
path: /
servingAdmin:
username: admin
password: admin
servingProxy:
nodePort: 30096
type: NodePort
partyList:
- partyId: 10000
partyIp: 10.0.1.205
partyPort: 30106
servingServer:
type: NodePort
nodePort: 30095
fateflow:
ip: 10.0.1.203
port: 30097
cacheSwitch: true
cacheType: "redis"
singleAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockAdapter
batchAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockBatchAdapter
AdapterURL: http://127.0.0.1:9380/v1/http/adapter/getFeature
party-10000-yaml
cluster.yaml
name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.9.0
partyId: 10000
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: Basic
device: CPU
ingress:
fateboard:
hosts:
- name: party10000.fateboard.205.pclab
client:
hosts:
- name: party10000.notebook.205.pclab
rollsite:
type: NodePort
nodePort: 30101
partyList:
- partyId: 9999
partyIp: 10.0.1.203
partyPort: 30091
python:
type: NodePort
httpNodePort: 30107
grpcNodePort: 30102
logLevel: INFO
servingIp: 10.0.1.205
servingPort: 30105
cluster-serving.yaml
name: fate-serving-10000
namespace: fate-serving-10000
chartName: fate-serving
chartVersion: v2.1.6
partyId: 10000
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- servingProxy
- servingRedis
- servingServer
- servingZookeeper
- servingAdmin
ingress:
servingProxy:
hosts:
- name: party10000.serving-proxy.205.pclab
path: /
servingAdmin:
hosts:
- name: party10000.serving-admin.205.pclab
path: /
servingAdmin:
username: admin
password: admin
servingProxy:
nodePort: 30106
type: NodePort
partyList:
- partyId: 9999
partyIp: 10.0.1.203
partyPort: 30096
servingServer:
type: NodePort
nodePort: 30105
fateflow:
ip: 10.0.1.205
port: 30107
cacheSwitch: true
cacheType: "redis"
singleAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockAdapter
batchAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockBatchAdapter
AdapterURL: http://127.0.0.1:9380/v1/http/adapter/getFeature
flow model load -c fateflow/examples/model/publish_load_model.json { "data": { "detail": { "guest": { "9999": { "retcode": 100, "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "failed to connect to all addresses"\n\tdebug_error_string = "{"created":"@1704163647.256664044","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1704163647.256661624","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"\n>" } }, "host": { "10000": { "retcode": 100, "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "failed to connect to all addresses"\n\tdebug_error_string = "{"created":"@1704163647.290205261","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1704163647.290197412","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"\n>" } } }, "guest": { "9999": 100 }, "host": { "10000": 100 } }, "jobId": "202401021047272331530", "retcode": 101, "retmsg": "failed" }
Same problem on version 1.8
Solved the problem. Key is to notice any permission error when deploying the training and serving stack.
bash docker_deploy.sh all --training
bash docker_deploy.sh all --serving