KubeFATE icon indicating copy to clipboard operation
KubeFATE copied to clipboard

v.1.9.0 flow model load error

Open szshary opened this issue 2 years ago • 7 comments

**What deployment mode you are use? ** Kuberentes.

**What KubeFATE and FATE version you are using? ** v.1.9.0

**What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. **

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

To Reproduce

Clear how to reproduce your problem.

What happen? flow model load -c fateflow/examples/model/publish_load_model.json Clear the unexpected response.

{ "data": { "detail": { "guest": { "9999": { "retcode": 100, "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "failed to connect to all addresses"\n\tdebug_error_string = "{"created":"@1662380004.985108799","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1662380004.985108157","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"\n>" } }, "host": { "10000": { "retcode": 0, "retmsg": "success" } } }, "guest": { "9999": 100 }, "host": { "10000": 0 } }, "jobId": "202209051213249727500", "retcode": 101, "retmsg": "failed" }

szshary avatar Sep 05 '22 12:09 szshary

flow model load -c fateflow/examples/model/publish_load_model.json

Could you list all the steps you have been through before this step?

JingChen23 avatar Sep 05 '22 12:09 JingChen23

flow model load -c fateflow/examples/model/publish_load_model.json

Could you list all the steps you have been through before this step?

party-10000

kubectl exec -it client-0 -n fate-10000 bash
flow data upload -c fateflow/examples/upload/upload_host.json

response

{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202209051315096029090&role=local&party_id=0",
        "code": 0,
        "dsl_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/job_dsl.json",
        "job_id": "202209051315096029090",
        "logs_directory": "/data/projects/fate/fateflow/logs/202209051315096029090",
        "message": "success",
        "model_info": {
            "model_id": "local-0#model",
            "model_version": "202209051315096029090"
        },
        "namespace": "experiment",
        "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/pipeline_dsl.json",
        "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/local/0/job_runtime_on_party_conf.json",
        "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/job_runtime_conf.json",
        "table_name": "breast_hetero_host",
        "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051315096029090/train_runtime_conf.json"
    },
    "jobId": "202209051315096029090",
    "retcode": 0,
    "retmsg": "success"
}

party-9999

kubectl exec -it client-0 -n fate-9999 bash
flow data upload -c fateflow/examples/upload/upload_guest.json

response

{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202209051318204094070&role=local&party_id=0",
        "code": 0,
        "dsl_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/job_dsl.json",
        "job_id": "202209051318204094070",
        "logs_directory": "/data/projects/fate/fateflow/logs/202209051318204094070",
        "message": "success",
        "model_info": {
            "model_id": "local-0#model",
            "model_version": "202209051318204094070"
        },
        "namespace": "experiment",
        "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/pipeline_dsl.json",
        "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/local/0/job_runtime_on_party_conf.json",
        "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/job_runtime_conf.json",
        "table_name": "breast_hetero_guest",
        "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051318204094070/train_runtime_conf.json"
    },
    "jobId": "202209051318204094070",
    "retcode": 0,
    "retmsg": "success"
}
flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json

response

{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202209051319590011940&role=guest&party_id=9999",
        "code": 0,
        "dsl_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/job_dsl.json",
        "job_id": "202209051319590011940",
        "logs_directory": "/data/projects/fate/fateflow/logs/202209051319590011940",
        "message": "success",
        "model_info": {
            "model_id": "arbiter-10000#guest-9999#host-10000#model",
            "model_version": "202209051319590011940"
        },
        "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/pipeline_dsl.json",
        "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/guest/9999/job_runtime_on_party_conf.json",
        "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/job_runtime_conf.json",
        "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202209051319590011940/train_runtime_conf.json"
    },
    "jobId": "202209051319590011940",
    "retcode": 0,
    "retmsg": "success"
}
flow task query -r guest -j 202209051319590011940 | grep -w f_status

response

            "f_status": "success",
            "f_status": "success",
            "f_status": "success",
            "f_status": "success",
            "f_status": "success",
            "f_status": "success",
            "f_status": "success",
flow model deploy --model-id arbiter-10000#guest-9999#host-10000#model --model-version 202209051319590011940

response

{
    "data": {
        "arbiter": {
            "10000": 0
        },
        "detail": {
            "arbiter": {
                "10000": {
                    "retcode": 0,
                    "retmsg": "deploy model of role arbiter 10000 success"
                }
            },
            "guest": {
                "9999": {
                    "retcode": 0,
                    "retmsg": "deploy model of role guest 9999 success"
                }
            },
            "host": {
                "10000": {
                    "retcode": 0,
                    "retmsg": "deploy model of role host 10000 success"
                }
            }
        },
        "guest": {
            "9999": 0
        },
        "host": {
            "10000": 0
        },
        "model_id": "arbiter-10000#guest-9999#host-10000#model",
        "model_version": "202209051322265771410"
    },
    "retcode": 0,
    "retmsg": "success"
}

modify publish_load_model.json

cat > fateflow/examples/model/publish_load_model.json <<EOF
{
  "initiator": {
    "party_id": "9999",
    "role": "guest"
  },
  "role": {
    "guest": [
      "9999"
    ],
    "host": [
      "10000"
    ],
    "arbiter": [
      "10000"
    ]
  },
  "job_parameters": {
    "model_id": "arbiter-10000#guest-9999#host-10000#model",
    "model_version": "202209051322265771410"
  }
}
EOF
flow model load -c fateflow/examples/model/publish_load_model.json

response

{
    "data": {
        "detail": {
            "guest": {
                "9999": {
                    "retcode": 100,
                    "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"failed to connect to all addresses\"\n\tdebug_error_string = \"{\"created\":\"@1662384564.572906342\",\"description\":\"Failed to pick subchannel\",\"file\":\"src/core/ext/filters/client_channel/client_channel.cc\",\"file_line\":3217,\"referenced_errors\":[{\"created\":\"@1662384564.572905670\",\"description\":\"failed to connect to all addresses\",\"file\":\"src/core/lib/transport/error_utils.cc\",\"file_line\":165,\"grpc_status\":14}]}\"\n>"
                }
            },
            "host": {
                "10000": {
                    "retcode": 0,
                    "retmsg": "success"
                }
            }
        },
        "guest": {
            "9999": 100
        },
        "host": {
            "10000": 0
        }
    },
    "jobId": "202209051329245513300",
    "retcode": 101,
    "retmsg": "failed"
}

szshary avatar Sep 05 '22 13:09 szshary

what does your cluster.yaml of 9999 look like? Have you also installed a fate-serving within the same cluster?

JingChen23 avatar Sep 05 '22 15:09 JingChen23

what does your cluster.yaml of 9999 look like? Have you also installed a fate-serving within the same cluster?

yes, within the same cluster

party-9999

kube-fate           mariadb-bf9dcd69b-52rdm                   1/1     Running     0          19h
kube-fate           kubefate-597c44897c-qqr2w                 1/1     Running     4          19h
fate-9999           rollsite-68d6d478c9-w6kps                 1/1     Running     0          17h
fate-9999           nodemanager-0                             2/2     Running     0          17h
fate-9999           clustermanager-687b657b66-l5ghw           1/1     Running     0          17h
fate-9999           mysql-0                                   1/1     Running     0          17h
fate-9999           nodemanager-1                             2/2     Running     0          17h
fate-9999           python-0                                  2/2     Running     0          17h
fate-9999           client-0                                  1/1     Running     0          16h
fate-serving-9999   serving-proxy-7686cddc4-sjcgk             1/1     Running     0          15h
fate-serving-9999   serving-redis-549b94cc7-djvr4             1/1     Running     0          15h
fate-serving-9999   serving-admin-57498849b7-jp5v7            1/1     Running     0          15h
fate-serving-9999   serving-server-7ff94bf986-4gnvf           1/1     Running     0          15h
fate-serving-9999   serving-zookeeper-0                       1/1     Running     0          15h

party-10000

fate-10000           nodemanager-0                             2/2     Running     0          17h
fate-10000           rollsite-554b49fc56-lm5h8                 1/1     Running     0          17h
fate-10000           clustermanager-58fbcf745-gjgff            1/1     Running     0          17h
fate-10000           nodemanager-1                             2/2     Running     0          17h
fate-10000           mysql-0                                   1/1     Running     0          17h
fate-10000           python-0                                  2/2     Running     0          17h
fate-10000           client-0                                  1/1     Running     0          17h
fate-serving-10000   serving-proxy-666c4974bb-4j5n4            1/1     Running     0          16h
fate-serving-10000   serving-admin-b9585b587-rmmng             1/1     Running     0          16h
fate-serving-10000   serving-server-69b5f5b5c4-zb4tw           1/1     Running     0          16h
fate-serving-10000   serving-redis-bc875b7f7-t2lzw             1/1     Running     0          16h
fate-serving-10000   serving-zookeeper-0                       1/1     Running     0          16h

party-9999-yaml

cluster.yaml

name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.9.0
partyId: 9999
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: Basic
device: CPU

ingress:
  fateboard:
    hosts:
    - name: party9999.fateboard.203.pclab
  client:  
    hosts:
    - name: party9999.notebook.203.pclab

rollsite:
  type: NodePort
  nodePort: 30091
  partyList:
    - partyId: 10000
      partyIp: 10.0.1.205
      partyPort: 30101

python:
  type: NodePort
  httpNodePort: 30097
  grpcNodePort: 30092
  logLevel: INFO

servingIp: 10.0.1.203
servingPort: 30095

cluster-serving.yaml

name: fate-serving-9999
namespace: fate-serving-9999
chartName: fate-serving
chartVersion: v2.1.6
partyId: 9999
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - servingProxy
  - servingRedis
  - servingServer
  - servingZookeeper
  - servingAdmin
  
ingress:
  servingProxy: 
    hosts:
    - name: party9999.serving-proxy.203.pclab
      path: /
  servingAdmin: 
    hosts:
    - name: party9999.serving-admin.203.pclab
      path: /
      
servingAdmin:
  username: admin
  password: admin

servingProxy: 
  nodePort: 30096
  type: NodePort
  partyList:
  - partyId: 10000
    partyIp: 10.0.1.205
    partyPort: 30106

servingServer:
  type: NodePort
  nodePort: 30095
  fateflow:
    ip: 10.0.1.203
    port: 30097
  cacheSwitch: true
  cacheType: "redis"
  singleAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockAdapter
  batchAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockBatchAdapter
  AdapterURL: http://127.0.0.1:9380/v1/http/adapter/getFeature

party-10000-yaml

cluster.yaml

name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.9.0
partyId: 10000
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: Basic
device: CPU

ingress:
  fateboard: 
    hosts:
    - name: party10000.fateboard.205.pclab
  client:  
    hosts:
    - name: party10000.notebook.205.pclab

rollsite: 
  type: NodePort
  nodePort: 30101
  partyList:
    - partyId: 9999
      partyIp: 10.0.1.203
      partyPort: 30091

python:
  type: NodePort
  httpNodePort: 30107
  grpcNodePort: 30102
  logLevel: INFO

servingIp: 10.0.1.205
servingPort: 30105

cluster-serving.yaml

name: fate-serving-10000
namespace: fate-serving-10000
chartName: fate-serving
chartVersion: v2.1.6
partyId: 10000
registry: "10.0.1.200:5000/federatedai"
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - servingProxy
  - servingRedis
  - servingServer
  - servingZookeeper
  - servingAdmin

ingress:
  servingProxy: 
    hosts:
    - name: party10000.serving-proxy.205.pclab
      path: /
  servingAdmin: 
    hosts:
    - name: party10000.serving-admin.205.pclab
      path: /

servingAdmin:
  username: admin
  password: admin

servingProxy: 
  nodePort: 30106
  type: NodePort
  partyList:
  - partyId: 9999
    partyIp: 10.0.1.203
    partyPort: 30096

servingServer:
  type: NodePort
  nodePort: 30105
  fateflow:
    ip: 10.0.1.205
    port: 30107
  cacheSwitch: true
  cacheType: "redis"
  singleAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockAdapter
  batchAdaptor: com.webank.ai.fate.serving.adaptor.dataaccess.MockBatchAdapter
  AdapterURL: http://127.0.0.1:9380/v1/http/adapter/getFeature

szshary avatar Sep 06 '22 03:09 szshary

flow model load -c fateflow/examples/model/publish_load_model.json { "data": { "detail": { "guest": { "9999": { "retcode": 100, "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "failed to connect to all addresses"\n\tdebug_error_string = "{"created":"@1704163647.256664044","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1704163647.256661624","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"\n>" } }, "host": { "10000": { "retcode": 100, "retmsg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "failed to connect to all addresses"\n\tdebug_error_string = "{"created":"@1704163647.290205261","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3217,"referenced_errors":[{"created":"@1704163647.290197412","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":165,"grpc_status":14}]}"\n>" } } }, "guest": { "9999": 100 }, "host": { "10000": 100 } }, "jobId": "202401021047272331530", "retcode": 101, "retmsg": "failed" }

jmj2633500154 avatar Jan 02 '24 06:01 jmj2633500154

Same problem on version 1.8

linzzzzzz avatar Feb 19 '24 14:02 linzzzzzz

Solved the problem. Key is to notice any permission error when deploying the training and serving stack.

bash docker_deploy.sh all --training

bash docker_deploy.sh all --serving

linzzzzzz avatar Feb 19 '24 16:02 linzzzzzz