kuscia icon indicating copy to clipboard operation
kuscia copied to clipboard

K8s部署kuscia中心化集群,Runp模式,执行隐私计算有问题

Open Meng-xiangkun opened this issue 1 year ago • 35 comments

Issue Type

Feature

Search for existing issues similar to yours

Yes

Kuscia Version

0.10.0b0

Link to Relevant Documentation

No response

Question Details

使用kuscia-secretflow:laster镜像做隐私求交计算时出现这个错误
Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
Failed to update kuscia job "dppm" status, Operation cannot be fulfilled on kusciajobs.kuscia.secretflow "dppm": the object has been modified; please apply your changes to the latest version and try again

Meng-xiangkun avatar Sep 13 '24 01:09 Meng-xiangkun

image

2024-09-12 18:30:34.303 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.317 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.317 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.420693ms)

2024-09-12 18:30:34.317 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.470899ms)

2024-09-12 18:30:34.317 INFO resources/kusciajob.go:82 update kuscia job dppm

2024-09-12 18:30:34.329 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (12.672843ms)

2024-09-12 18:30:34.330 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.343 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.343 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.248207ms)

2024-09-12 18:30:34.343 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.29884ms)

2024-09-12 18:30:34.345 INFO handler/job_scheduler.go:323 Create kuscia tasks: dppm-qvxgwzap-node-35

2024-09-12 18:30:34.357 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.369 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.369 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.370 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.370 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (25.113735ms)

2024-09-12 18:30:34.370 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (25.15742ms)

2024-09-12 18:30:34.370 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=}}, kusciaJobId=dppm

2024-09-12 18:30:34.370 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.383 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.383 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.385 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.386 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (15.795756ms)

2024-09-12 18:30:34.386 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (15.879731ms)

2024-09-12 18:30:34.388 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=}}, kusciaJobId=dppm

2024-09-12 18:30:34.388 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (488.279µs)

2024-09-12 18:30:34.399 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.399 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.423 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing

2024-09-12 18:30:34.424 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry

2024-09-12 18:30:34.472 INFO resources/kusciatask.go:69 Start updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.488 INFO resources/kusciatask.go:71 Finish updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.488 INFO kusciatask/controller.go:521 Finished syncing kusciatask "dppm-qvxgwzap-node-35" (24.193535ms)

2024-09-12 18:30:34.490 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=Failed}}, kusciaJobId=dppm

2024-09-12 18:30:34.490 INFO handler/job_scheduler.go:679 jobStatusPhaseFrom failed readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=Failed}}, kusciaJobId=dppm

2024-09-12 18:30:34.491 WARN handler/failed_handler.go:62 Get task resource group dppm-qvxgwzap-node-35 failed, skip setting its status to failed, taskresourcegroup.kuscia.secretflow "dppm-qvxgwzap-node-35" not found

2024-09-12 18:30:34.491 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.491 INFO resources/kusciatask.go:69 Start updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.505 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.505 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (14.950352ms)

2024-09-12 18:30:34.505 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (14.972553ms)

2024-09-12 18:30:34.510 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.510 INFO resources/kusciatask.go:71 Finish updating kuscia task "dppm-qvxgwzap-node-35" status

2024-09-12 18:30:34.510 INFO kusciatask/controller.go:521 Finished syncing kusciatask "dppm-qvxgwzap-node-35" (19.491329ms)

2024-09-12 18:30:34.510 INFO kusciatask/controller.go:489 KusciaTask "dppm-qvxgwzap-node-35" was finished, skipping

2024-09-12 18:30:34.523 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.523 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.33302ms)

2024-09-12 18:30:34.523 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.376915ms)

2024-09-12 18:30:34.523 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.534 WARN resources/kusciajob.go:122 Failed to update kuscia job "dppm" status, Operation cannot be fulfilled on kusciajobs.kuscia.secretflow "dppm": the object has been modified; please apply your changes to the latest version and try again

2024-09-12 18:30:34.542 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status

2024-09-12 18:30:34.554 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status

2024-09-12 18:30:34.555 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (31.853225ms)

2024-09-12 18:30:34.555 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (31.901265ms)

2024-09-12 18:30:34.555 INFO handler/job_scheduler.go:700 KusciaJob dppm was finished, skipping

2024-09-12 18:30:34.555 INFO kusciajob/controller.go:266 KusciaJob "dppm" should not reconcile again, skipping

2024-09-12 18:30:34.555 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (111.519µs)

Meng-xiangkun avatar Sep 13 '24 01:09 Meng-xiangkun

异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息

lanyy9527 avatar Sep 13 '24 01:09 lanyy9527

异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息

image 是这样吗?

Meng-xiangkun avatar Sep 13 '24 02:09 Meng-xiangkun

可以加上你的namespace(-n name),或者-A查看所有

lanyy9527 avatar Sep 13 '24 02:09 lanyy9527

可以加上你的namespace(-n name),或者-A查看所有

image 还是一样的

Meng-xiangkun avatar Sep 13 '24 02:09 Meng-xiangkun

异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息

作业任务详细信息

sh-4.4# kubectl get kt jaqj-qvxgwzap-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2024-09-12T10:49:29Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-id: jaqj
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: jaqj-qvxgwzap-node-35
  name: jaqj-qvxgwzap-node-35
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: jaqj
    uid: 9a2a5920-c23d-409d-afdc-14d82e5e53e4
  resourceVersion: "14340"
  uid: 73a41e0d-4b9d-4d03-b5eb-261efb760b15
spec:
  initiator: bob
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id1"]
        }, {
          "is_na": false,
          "ss": ["id2"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckjaqj-qvxgwzap-node-35-output-0"
      },
      "sf_output_uris": ["jaqj-qvxgwzap-node-35-output-0"],
      "sf_input_ids": ["alice-table", "bob-table"],
      "sf_output_ids": ["jaqj-qvxgwzap-node-35-output-0"]
    }
status:
  completionTime: "2024-09-12T10:49:29Z"
  conditions:
  - lastTransitionTime: "2024-09-12T10:49:29Z"
    message: Failed to create kusciaTask related resources, failed to build domain
      bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow
      "secretflow-image" not found
    reason: KusciaTaskCreateFailed
    status: "False"
    type: ResourceCreated
  lastReconcileTime: "2024-09-12T10:49:29Z"
  message: 'KusciaTask failed after 3x retry, last error: failed to build domain bob
    kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow
    "secretflow-image" not found'
  phase: Failed
  startTime: "2024-09-12T10:49:29Z"

Meng-xiangkun avatar Sep 13 '24 02:09 Meng-xiangkun

再检查一下部署节点步骤,appimage 需要手动创建 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#appimage

yushiqie avatar Sep 13 '24 02:09 yushiqie

再检查一下部署节点步骤,appimage 需要手动创建 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#appimage

image 文件不存在,这个文件要放到哪里啊

Meng-xiangkun avatar Sep 13 '24 02:09 Meng-xiangkun

image 文件不存在,这个文件要放到哪里啊

Meng-xiangkun avatar Sep 13 '24 03:09 Meng-xiangkun

image 文件不存在,这个文件要放到哪里啊

可以查看下当前路径下是否有AppImage.yaml这个文件

lanyy9527 avatar Sep 13 '24 05:09 lanyy9527

image 文件不存在,这个文件要放到哪里啊

可以查看下当前路径下是否有AppImage.yaml这个文件

没有这个文件,上传一份吗,上传到那个位置呀

Meng-xiangkun avatar Sep 13 '24 06:09 Meng-xiangkun

上面的问题好了,现在任务一直pending,获取不到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8这个镜像,还有其他方法使用这个镜像吗,除了从secretflow-registry.cn-hangzhou.cr.aliyuncs.com拉取,集群环境不允许拉取外部镜像。

sh-4.4# kubectl get kt -n cross-domain
NAME                    STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
fpvu-alice              3m30s       3m30s            3m30s               Failed
gere-bob                3m30s       3m30s            3m30s               Failed
alzf-qvxgwzap-node-35   2m36s                        2m18s               Pending
sh-4.4# kubectl get kt alzf-qvxgwzap-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2024-09-13T06:27:39Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-id: alzf
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: alzf-qvxgwzap-node-35
  name: alzf-qvxgwzap-node-35
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: alzf
    uid: 1c3bf688-1a1d-4ba1-98dc-9239ec113ebd
  resourceVersion: "2736"
  uid: 7a1f8356-82da-44f4-8b10-cab10b0a87be
spec:
  initiator: bob
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id1"]
        }, {
          "is_na": false,
          "ss": ["id2"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckalzf-qvxgwzap-node-35-output-0"
      },
      "sf_output_uris": ["alzf-qvxgwzap-node-35-output-0"],
      "sf_input_ids": ["alice-table", "bob-table"],
      "sf_output_ids": ["alzf-qvxgwzap-node-35-output-0"]
    }
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      alzf-qvxgwzap-node-35-0/client-server: 31454
      alzf-qvxgwzap-node-35-0/fed: 31450
      alzf-qvxgwzap-node-35-0/global: 31451
      alzf-qvxgwzap-node-35-0/node-manager: 31452
      alzf-qvxgwzap-node-35-0/object-manager: 31453
      alzf-qvxgwzap-node-35-0/spu: 31449
  - domainID: bob
    namedPort:
      alzf-qvxgwzap-node-35-0/client-server: 32739
      alzf-qvxgwzap-node-35-0/fed: 32741
      alzf-qvxgwzap-node-35-0/global: 32742
      alzf-qvxgwzap-node-35-0/node-manager: 32737
      alzf-qvxgwzap-node-35-0/object-manager: 32738
      alzf-qvxgwzap-node-35-0/spu: 32740
  conditions:
  - lastTransitionTime: "2024-09-13T06:27:39Z"
    status: "True"
    type: ResourceCreated
  lastReconcileTime: "2024-09-13T06:27:57Z"
  phase: Pending
  podStatuses:
    alice/alzf-qvxgwzap-node-35-0:
      createTime: "2024-09-13T06:27:39Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: alice
      nodeName: kuscia-lite-alice-9b7cdf6fd-l8dt5
      podName: alzf-qvxgwzap-node-35-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-09-13T06:27:41Z"
    bob/alzf-qvxgwzap-node-35-0:
      createTime: "2024-09-13T06:27:39Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: bob
      nodeName: kuscia-lite-bob-7df5b89f5-vcrl9
      podName: alzf-qvxgwzap-node-35-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-09-13T06:27:41Z"
  serviceStatuses:
    alice/alzf-qvxgwzap-node-35-0-fed:
      createTime: "2024-09-13T06:27:39Z"
      namespace: alice
      portName: fed
      portNumber: 31450
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-fed
    alice/alzf-qvxgwzap-node-35-0-global:
      createTime: "2024-09-13T06:27:39Z"
      namespace: alice
      portName: global
      portNumber: 31451
      readyTime: "2024-09-13T06:27:41Z"
      scope: Domain
      serviceName: alzf-qvxgwzap-node-35-0-global
    alice/alzf-qvxgwzap-node-35-0-spu:
      createTime: "2024-09-13T06:27:39Z"
      namespace: alice
      portName: spu
      portNumber: 31449
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-spu
    bob/alzf-qvxgwzap-node-35-0-fed:
      createTime: "2024-09-13T06:27:39Z"
      namespace: bob
      portName: fed
      portNumber: 32741
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-fed
    bob/alzf-qvxgwzap-node-35-0-global:
      createTime: "2024-09-13T06:27:39Z"
      namespace: bob
      portName: global
      portNumber: 32742
      readyTime: "2024-09-13T06:27:41Z"
      scope: Domain
      serviceName: alzf-qvxgwzap-node-35-0-global
    bob/alzf-qvxgwzap-node-35-0-spu:
      createTime: "2024-09-13T06:27:39Z"
      namespace: bob
      portName: spu
      portNumber: 32740
      readyTime: "2024-09-13T06:27:41Z"
      scope: Cluster
      serviceName: alzf-qvxgwzap-node-35-0-spu
  startTime: "2024-09-13T06:27:39Z"

Meng-xiangkun avatar Sep 13 '24 06:09 Meng-xiangkun

kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:

  1. 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
  2. 升级 kuscia 版本到 0.11.x

yushiqie avatar Sep 13 '24 07:09 yushiqie

kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:

  1. 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
  2. 升级 kuscia 版本到 0.11.x

ERROR: failed to solve: secretflow/anolis8-python:3.10.13: failed to resolve source metadata for docker.io/secretflow/anolis8-python:3.10.13: failed to do request: Head "https://registry-1.docker.io/v2/secretflow/anolis8-python/manifests/3.10.13": dial tcp 108.160.169.185:443: connect: connection refused 这个镜像还有别的地址能拉取吗

Meng-xiangkun avatar Sep 13 '24 07:09 Meng-xiangkun

可以用 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/anolis8-python:3.10.13

yushiqie avatar Sep 13 '24 07:09 yushiqie

image https://github.com/secretflow/secretpad/issues/130

wangzul avatar Sep 13 '24 09:09 wangzul

kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:

  1. 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
  2. 升级 kuscia 版本到 0.11.x

我使用了 将 kuscia 和 secretflow 打包在一起的镜像,还是报这个错 "Failed to inspect image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0": failed to get image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0" manifest, detail-> image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0" not exist in local repository"' 是需要配置修改什么吗?才能找到镜像

Meng-xiangkun avatar Sep 13 '24 09:09 Meng-xiangkun

image secretflow/secretpad#130

是按照图片这么设置的

Meng-xiangkun avatar Sep 13 '24 09:09 Meng-xiangkun

看下 dockerfile 默认导入的 secretflow 版本 https://github.com/secretflow/kuscia/blob/release/0.10.x/build/dockerfile/kuscia-secretflow.Dockerfile#L15

yushiqie avatar Sep 13 '24 09:09 yushiqie

看下 dockerfile 默认导入的 secretflow 版本 https://github.com/secretflow/kuscia/blob/release/0.10.x/build/dockerfile/kuscia-secretflow.Dockerfile#L15

镜像问题解决了,现在用本地上传的数据集进行隐私求交计算的时候失败了

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  annotations:
    kuscia.secretflow/job-id: gsid
    kuscia.secretflow/self-cluster-as-participant: "true"
    kuscia.secretflow/task-alias: gsid-dwdkvwbe-node-35
  creationTimestamp: "2024-09-14T02:37:41Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-uid: 25d3045f-2277-41d3-8cb6-eeb23747073b
  name: gsid-dwdkvwbe-node-35
  namespace: cross-domain
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: gsid
    uid: 25d3045f-2277-41d3-8cb6-eeb23747073b
  resourceVersion: "12285"
  uid: 3f11ec51-7e6c-4928-89f6-b16374ef50b5
spec:
  initiator: bob
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id"]
        }, {
          "is_na": false,
          "ss": ["id"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice1_1010363635.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob1_1907238687.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckgsid-dwdkvwbe-node-35-output-0"
      },
      "sf_output_uris": ["gsid-dwdkvwbe-node-35-output-0"],
      "sf_input_ids": ["astrqxxq", "yxcxhdat"],
      "sf_output_ids": ["gsid-dwdkvwbe-node-35-output-0"]
    }
status:
  allocatedPorts:
  - domainID: bob
    namedPort:
      gsid-dwdkvwbe-node-35-0/client-server: 20393
      gsid-dwdkvwbe-node-35-0/fed: 20395
      gsid-dwdkvwbe-node-35-0/global: 20390
      gsid-dwdkvwbe-node-35-0/node-manager: 20391
      gsid-dwdkvwbe-node-35-0/object-manager: 20392
      gsid-dwdkvwbe-node-35-0/spu: 20394
  - domainID: alice
    namedPort:
      gsid-dwdkvwbe-node-35-0/client-server: 21057
      gsid-dwdkvwbe-node-35-0/fed: 21059
      gsid-dwdkvwbe-node-35-0/global: 21054
      gsid-dwdkvwbe-node-35-0/node-manager: 21055
      gsid-dwdkvwbe-node-35-0/object-manager: 21056
      gsid-dwdkvwbe-node-35-0/spu: 21058
  completionTime: "2024-09-14T02:37:57Z"
  conditions:
  - lastTransitionTime: "2024-09-14T02:37:41Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2024-09-14T02:37:43Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2024-09-14T02:37:57Z"
    status: "False"
    type: Success
  lastReconcileTime: "2024-09-14T02:37:57Z"
  message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice],
    successful party[], failed party[bob]
  partyTaskStatus:
  - domainID: bob
    phase: Failed
  - domainID: alice
    phase: Failed
  phase: Failed
  podStatuses:
    alice/gsid-dwdkvwbe-node-35-0:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      nodeName: kuscia-lite-alice-784b59647f-55mdx
      podName: gsid-dwdkvwbe-node-35-0
      podPhase: Failed
      readyTime: "2024-09-14T02:37:44Z"
      startTime: "2024-09-14T02:37:43Z"
    bob/gsid-dwdkvwbe-node-35-0:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      nodeName: kuscia-lite-bob-6d7d6c998f-zhtll
      podName: gsid-dwdkvwbe-node-35-0
      podPhase: Failed
      readyTime: "2024-09-14T02:37:43Z"
      reason: Error
      startTime: "2024-09-14T02:37:43Z"
      terminationLog: 'container[secretflow] terminated state reason "Error", message:
        "... Ignore 12413 characters at the beginning ...\ning_failure'': True}\n\x1b[36m(SenderReceiverProxyActor
        pid=9199)\x1b[0m I0914 10:37:52.646880  9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1181]
        Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on
        port=20395.\n\x1b[36m(SenderReceiverProxyActor pid=9199)\x1b[0m W0914 10:37:52.646909  9199
        external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are
        disabled according to ServerOptions.has_builtin_services\n\x1b[36m(SenderReceiverProxyActor
        pid=9199)\x1b[0m I0914 10:37:53.321158  9421 external/com_github_brpc_brpc/src/brpc/span.cpp:506]
        Opened ./rpc_data/rpcz/20240914.103753.9199/id.db and ./rpc_data/rpcz/20240914.103753.9199/time.db\n2024-09-14
        10:37:53.676 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create
        receiver proxy actor.\n2024-09-14 10:37:53.676 INFO barriers.py:520 [bob]
        -- [Anonymous_job] Try ping [''alice''] at 0 attemp, up to 3600 attemps.\n2024-09-14
        10:37:53.685 WARNING psi.py:361 [bob] -- [Anonymous_job] {''cluster_def'':
        {''nodes'': [{''party'': ''bob'', ''address'': ''0.0.0.0:20394'', ''listen_address'':
        ''''}, {''party'': ''alice'', ''address'': ''http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80'',
        ''listen_address'': ''''}], ''runtime_config'': {''protocol'': 2, ''field'':
        3}}, ''link_desc'': {''connect_retry_times'': 60, ''connect_retry_interval_ms'':
        1000, ''brpc_channel_protocol'': ''http'', ''brpc_channel_connection_type'':
        ''pooled'', ''recv_timeout_ms'': 1200000, ''http_timeout_ms'': 1200000}}\n2024-09-14
        10:37:55.340 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain:
        \"data_prep\"\nname: \"psi\"\nversion: \"0.0.5\"\nattr_paths: \"input/receiver_input/key\"\nattr_paths:
        \"input/sender_input/key\"\nattr_paths: \"protocol\"\nattr_paths: \"sort_result\"\nattr_paths:
        \"allow_duplicate_keys\"\nattr_paths: \"allow_duplicate_keys/no/skip_duplicates_check\"\nattr_paths:
        \"fill_value_int\"\nattr_paths: \"ecdh_curve\"\nattrs {\n  ss: \"id\"\n}\nattrs
        {\n  ss: \"id\"\n}\nattrs {\n  s: \"PROTOCOL_RR22\"\n}\nattrs {\n  b: true\n}\nattrs
        {\n  s: \"no\"\n}\nattrs {\n  is_na: true\n}\nattrs {\n  is_na: true\n}\nattrs
        {\n  s: \"CURVE_FOURQ\"\n}\ninputs {\n  name: \"alice1\"\n  type: \"sf.table.individual\"\n  meta
        {\n    type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"\n    value:
        \"\\n\\t\\022\\002id*\\003int\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"\n  }\n  data_refs
        {\n    uri: \"alice1_1010363635.csv\"\n    party: \"alice\"\n    format: \"csv\"\n  }\n}\ninputs
        {\n  name: \"bob1\"\n  type: \"sf.table.individual\"\n  meta {\n    type_url:
        \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"\n    value: \"\\n\\t\\022\\002id*\\003int\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"\n  }\n  data_refs
        {\n    uri: \"bob1_1907238687.csv\"\n    party: \"bob\"\n    format: \"csv\"\n  }\n}\noutput_uris:
        \"gsid-dwdkvwbe-node-35-output-0\"\ncheckpoint_uri: \"ckgsid-dwdkvwbe-node-35-output-0\"\n
        failed, error <\x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  At
        least one of the input arguments for this task could not be computed:\nray.exceptions.RayTaskError:
        \x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 839, in download_file\n    comp_storage.download_file(uri, output_path)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 32, in download_file\n    impl.download_file(remote_fn, local_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 171, in download_file\n    assert os.path.exists(full_remote_fn)\nAssertionError>\n2024-09-14
        10:37:55.341 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...\n2024-09-14
        10:37:55.341 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.\n2024-09-14
        10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message
        polling thread[DataSendingQueueThread] to exit.\n2024-09-14 10:37:55.342 INFO
        message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread]
        to exit.\n2024-09-14 10:37:55.342 INFO api.py:384 [bob] -- [Anonymous_job]
        Shutdowned rayfed.\n\x1b[33m(raylet)\x1b[0m [2024-09-14 10:37:54,186 I 9422
        9422] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL
        to -1\x1b[32m [repeated 3x across cluster] (Ray deduplicates logs by default.
        Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication
        for more options.)\x1b[0m\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/runpy.py\",
        line 196, in _run_module_as_main\n    return _run_code(code, main_globals,
        None,\n  File \"/usr/local/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code,
        run_globals)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 547, in <module>\n    main()\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1078, in main\n    rv = self.invoke(ctx)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File
        \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n    return
        __callback(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 527, in main\n    res = comp_eval(sf_node_eval_param, storage_config,
        sf_cluster_config)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\",
        line 176, in comp_eval\n    res = comp.eval(\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1132, in eval\n    raise e from None\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1127, in eval\n    ret = self.__eval_callback(ctx=ctx, **kwargs)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\",
        line 371, in two_party_balanced_psi_eval_fn\n    download_files(ctx, uri,
        input_path)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 847, in download_files\n    wait(waits)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\",
        line 213, in wait\n    reveal([o.device(lambda o: None)(o) for o in objs])\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line
        162, in reveal\n    all_object = sfd.get(all_object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\",
        line 156, in get\n    return fed.get(object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\",
        line 621, in get\n    values = ray.get(ray_refs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\",
        line 22, in auto_init_wrapper\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\",
        line 103, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\",
        line 2624, in get\n    raise value.as_instanceof_cause()\nray.exceptions.RayTaskError(AssertionError):
        \x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  At
        least one of the input arguments for this task could not be computed:\nray.exceptions.RayTaskError:
        \x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 839, in download_file\n    comp_storage.download_file(uri, output_path)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 32, in download_file\n    impl.download_file(remote_fn, local_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 171, in download_file\n    assert os.path.exists(full_remote_fn)\nAssertionError\n"'
  serviceStatuses:
    alice/gsid-dwdkvwbe-node-35-0-fed:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      portName: fed
      portNumber: 21059
      readyTime: "2024-09-14T02:37:44Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-fed
    alice/gsid-dwdkvwbe-node-35-0-global:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      portName: global
      portNumber: 21054
      readyTime: "2024-09-14T02:37:44Z"
      scope: Domain
      serviceName: gsid-dwdkvwbe-node-35-0-global
    alice/gsid-dwdkvwbe-node-35-0-spu:
      createTime: "2024-09-14T02:37:41Z"
      namespace: alice
      portName: spu
      portNumber: 21058
      readyTime: "2024-09-14T02:37:44Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-spu
    bob/gsid-dwdkvwbe-node-35-0-fed:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      portName: fed
      portNumber: 20395
      readyTime: "2024-09-14T02:37:43Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-fed
    bob/gsid-dwdkvwbe-node-35-0-global:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      portName: global
      portNumber: 20390
      readyTime: "2024-09-14T02:37:43Z"
      scope: Domain
      serviceName: gsid-dwdkvwbe-node-35-0-global
    bob/gsid-dwdkvwbe-node-35-0-spu:
      createTime: "2024-09-14T02:37:41Z"
      namespace: bob
      portName: spu
      portNumber: 20394
      readyTime: "2024-09-14T02:37:43Z"
      scope: Cluster
      serviceName: gsid-dwdkvwbe-node-35-0-spu
  startTime: "2024-09-14T02:37:41Z"

Meng-xiangkun avatar Sep 14 '24 02:09 Meng-xiangkun

参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6

wangzul avatar Sep 14 '24 02:09 wangzul

参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6

alice节点下的pod日志

WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
2024-09-14 10:37:47,052|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='gsid-dwdkvwbe-node-35-0-global.alice.svc', ray_node_manager_port=21055, ray_object_manager_port=21056, ray_client_server_port=21057, ray_worker_ports=[], ray_gcs_port=21054)
2024-09-14 10:37:47,058|alice|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at gsid-dwdkvwbe-node-35-0-global.alice.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=gsid-dwdkvwbe-node-35-0-global.alice.svc --port=21054 --node-manager-port=21055 --object-manager-port=21056 --ray-client-server-port=21057
2024-09-14 10:37:51,042|alice|INFO|secretflow|entry.py:start_ray:80| 2024-09-14 10:37:47,713    INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-09-14 10:37:47,713 INFO scripts.py:744 -- Local node IP: gsid-dwdkvwbe-node-35-0-global.alice.svc
2024-09-14 10:37:50,726 SUCC scripts.py:781 -- --------------------
2024-09-14 10:37:50,727 SUCC scripts.py:782 -- Ray runtime started.
2024-09-14 10:37:50,727 SUCC scripts.py:783 -- --------------------
2024-09-14 10:37:50,727 INFO scripts.py:785 -- Next steps
2024-09-14 10:37:50,727 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-09-14 10:37:50,727 INFO scripts.py:791 --   ray start --address='gsid-dwdkvwbe-node-35-0-global.alice.svc:21054'
2024-09-14 10:37:50,727 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-09-14 10:37:50,728 INFO scripts.py:802 -- import ray
2024-09-14 10:37:50,728 INFO scripts.py:803 -- ray.init(_node_ip_address='gsid-dwdkvwbe-node-35-0-global.alice.svc')
2024-09-14 10:37:50,728 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-09-14 10:37:50,728 INFO scripts.py:835 --   ray stop
2024-09-14 10:37:50,728 INFO scripts.py:838 -- To view the status of the cluster, use
2024-09-14 10:37:50,728 INFO scripts.py:839 --   ray status

2024-09-14 10:37:51,042|alice|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at gsid-dwdkvwbe-node-35-0-global.alice.svc.
2024-09-14 10:37:51,047|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param  {
  "domain": "data_prep",
  "name": "psi",
  "version": "0.0.5",
  "attrPaths": [
    "input/receiver_input/key",
    "input/sender_input/key",
    "protocol",
    "sort_result",
    "allow_duplicate_keys",
    "allow_duplicate_keys/no/skip_duplicates_check",
    "fill_value_int",
    "ecdh_curve"
  ],
  "attrs": [
    {
      "ss": [
        "id"
      ]
    },
    {
      "ss": [
        "id"
      ]
    },
    {
      "s": "PROTOCOL_RR22"
    },
    {
      "b": true
    },
    {
      "s": "no"
    },
    {
      "isNa": true
    },
    {
      "isNa": true
    },
    {
      "s": "CURVE_FOURQ"
    }
  ],
  "inputs": [
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "alice1_1010363635.csv",
          "party": "alice",
          "format": "csv"
        }
      ]
    },
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "bob1_1907238687.csv",
          "party": "bob",
          "format": "csv"
        }
      ]
    }
  ],
  "checkpointUri": "ckgsid-dwdkvwbe-node-35-output-0"
}
2024-09-14 10:37:51,059|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:51,059|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id astrqxxq to
...........
name: "alice1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "alice1_1010363635.csv"
  party: "alice"
  format: "csv"
}

....
2024-09-14 10:37:51,070|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:51,070|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id yxcxhdat to
...........
name: "bob1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "bob1_1907238687.csv"
  party: "bob"
  format: "csv"
}

....
2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:169|
--
Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7
--

2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:170|
--
*param*

domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"

--

2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:171|
--
*storage_config*

type: "local_fs"
local_fs {
  wd: "/home/kuscia/var/storage/data"
}

--

2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:172|
--
*cluster_config*

desc {
  parties: "bob"
  parties: "alice"
  devices {
    name: "spu"
    type: "spu"
    parties: "bob"
    parties: "alice"
    config: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
  }
  devices {
    name: "heu"
    type: "heu"
    parties: "bob"
    parties: "alice"
    config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
  }
  ray_fed_config {
    cross_silo_comm_backend: "brpc_link"
  }
}
public_config {
  ray_fed_config {
    parties: "bob"
    parties: "alice"
    addresses: "gsid-dwdkvwbe-node-35-0-fed.bob.svc:80"
    addresses: "0.0.0.0:21059"
  }
  spu_configs {
    name: "spu"
    parties: "bob"
    parties: "alice"
    addresses: "http://gsid-dwdkvwbe-node-35-0-spu.bob.svc:80"
    addresses: "0.0.0.0:21058"
  }
}
private_config {
  self_party: "alice"
  ray_head_addr: "gsid-dwdkvwbe-node-35-0-global.alice.svc:21054"
}

--

2024-09-14 10:37:51,074|alice|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-14 10:37:51,074 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: gsid-dwdkvwbe-node-35-0-global.alice.svc:21054...
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005728 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005728 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005728 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,088|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005728 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,092|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,092|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005824 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005824 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005584 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005584 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005824 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005824 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005584 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005584 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095 INFO worker.py:1724 -- Connected to Ray cluster.
2024-09-14 10:37:51.870 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': 'http://gsid-dwdkvwbe-node-35-0-fed.bob.svc:80', 'alice': '0.0.0.0:21059'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}}
(raylet) [2024-09-14 10:37:52,467 I 9291 9291] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=9291) 2024-09-14 10:37:53.277 INFO link.py:38 [alice] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000,'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=9291) I0914 10:37:53.306789  9291 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=21059.
(SenderReceiverProxyActor pid=9291) W0914 10:37:53.306837  9291 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-09-14 10:37:53.675 INFO barriers.py:465 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-09-14 10:37:53.675 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
2024-09-14 10:37:53.683 WARNING psi.py:361 [alice] -- [Anonymous_job] {'cluster_def': {'nodes': [{'party': 'bob', 'address': 'http://gsid-dwdkvwbe-node-35-0-spu.bob.svc:80', 'listen_address': ''}, {'party': 'alice', 'address': '0.0.0.0:21058', 'listen_address':''}], 'runtime_config': {'protocol': 2, 'field': 3}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}}
(SenderReceiverProxyActor pid=9291) I0914 10:37:53.680885  9513 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240914.103753.9291/id.db and ./rpc_data/rpcz/20240914.103753.9291/time.db
2024-09-14 10:37:55.665 ERROR component.py:1130 [alice] -- [Anonymous_job] eval on domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
 failed, error <ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError>
2024-09-14 10:37:55.666 INFO api.py:342 [alice] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-14 10:37:55.666 INFO api.py:356 [alice] -- [Anonymous_job] No wait for data sending.
2024-09-14 10:37:55.668 INFO message_queue.py:72 [alice] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-14 10:37:55.669 INFO message_queue.py:72 [alice] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-14 10:37:55.669 INFO api.py:384 [alice] -- [Anonymous_job] Shutdowned rayfed.
2024-09-14 10:37:55.670 WARNING cleanup.py:154 [alice] -- [Anonymous_job] Failed to send ObjectRef(82891771158d68c1fcce2f44215c103cf6cd60270100000001000000) to bob, error: ray::SenderReceiverProxyActor.send() (pid=9291, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc, actor_id=fcce2f44215c103cf6cd602701000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fec182ddde0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError,upstream_seq_id: 7#0, downstream_seq_id: 9.
2024-09-14 10:37:55.670 INFO cleanup.py:161 [alice] -- [Anonymous_job] Sending error  to bob.
Exception in thread DataSendingQueueThread:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 152, in _process_data_sending_task_return
    res = ray.get(obj_ref)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::SenderReceiverProxyActor.send() (pid=9291, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc, actor_id=fcce2f44215c103cf6cd602701000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fec182ddde0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/fed/_private/message_queue.py", line 51, in _loop
    res = self._msg_handler(message)
  File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 47, in <lambda>
    lambda msg: self._process_data_sending_task_return(msg),
  File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 166, in _process_data_sending_task_return
    send(
  File "/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py", line 502, in send
    get_global_context().get_cleanup_manager().push_to_sending(
AttributeError: 'NoneType' object has no attribute 'get_cleanup_manager'
(raylet) [2024-09-14 10:37:54,180 I 9514 9514] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
    res = comp.eval(
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
    raise e from None
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
    ret = self.__eval_callback(ctx=ctx, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py", line 371, in two_party_balanced_psi_eval_fn
    download_files(ctx, uri, input_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 847, in download_files
    wait(waits)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 213, in wait
    reveal([o.device(lambda o: None)(o) for o in objs])
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
    all_object = sfd.get(all_object_refs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
    return fed.get(object_refs)
  File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
    values = ray.get(ray_refs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError

Meng-xiangkun avatar Sep 14 '24 03:09 Meng-xiangkun

参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6

bob节点下的pod日志

WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
2024-09-14 10:37:46,688|bob|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='gsid-dwdkvwbe-node-35-0-global.bob.svc', ray_node_manager_port=20391, ray_object_manager_port=20392, ray_client_server_port=20393, ray_worker_ports=[], ray_gcs_port=20390)
2024-09-14 10:37:46,694|bob|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at gsid-dwdkvwbe-node-35-0-global.bob.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=gsid-dwdkvwbe-node-35-0-global.bob.svc --port=20390 --node-manager-port=20391 --object-manager-port=20392 --ray-client-server-port=20393
2024-09-14 10:37:50,465|bob|INFO|secretflow|entry.py:start_ray:80| 2024-09-14 10:37:47,288      INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-09-14 10:37:47,288 INFO scripts.py:744 -- Local node IP: gsid-dwdkvwbe-node-35-0-global.bob.svc
2024-09-14 10:37:50,314 SUCC scripts.py:781 -- --------------------
2024-09-14 10:37:50,314 SUCC scripts.py:782 -- Ray runtime started.
2024-09-14 10:37:50,314 SUCC scripts.py:783 -- --------------------
2024-09-14 10:37:50,314 INFO scripts.py:785 -- Next steps
2024-09-14 10:37:50,315 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-09-14 10:37:50,315 INFO scripts.py:791 --   ray start --address='gsid-dwdkvwbe-node-35-0-global.bob.svc:20390'
2024-09-14 10:37:50,315 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-09-14 10:37:50,315 INFO scripts.py:802 -- import ray
2024-09-14 10:37:50,315 INFO scripts.py:803 -- ray.init(_node_ip_address='gsid-dwdkvwbe-node-35-0-global.bob.svc')
2024-09-14 10:37:50,315 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-09-14 10:37:50,315 INFO scripts.py:835 --   ray stop
2024-09-14 10:37:50,315 INFO scripts.py:838 -- To view the status of the cluster, use
2024-09-14 10:37:50,315 INFO scripts.py:839 --   ray status

2024-09-14 10:37:50,465|bob|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at gsid-dwdkvwbe-node-35-0-global.bob.svc.
2024-09-14 10:37:50,470|bob|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param  {
  "domain": "data_prep",
  "name": "psi",
  "version": "0.0.5",
  "attrPaths": [
    "input/receiver_input/key",
    "input/sender_input/key",
    "protocol",
    "sort_result",
    "allow_duplicate_keys",
    "allow_duplicate_keys/no/skip_duplicates_check",
    "fill_value_int",
    "ecdh_curve"
  ],
  "attrs": [
    {
      "ss": [
        "id"
      ]
    },
    {
      "ss": [
        "id"
      ]
    },
    {
      "s": "PROTOCOL_RR22"
    },
    {
      "b": true
    },
    {
      "s": "no"
    },
    {
      "isNa": true
    },
    {
      "isNa": true
    },
    {
      "s": "CURVE_FOURQ"
    }
  ],
  "inputs": [
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "alice1_1010363635.csv",
          "party": "alice",
          "format": "csv"
        }
      ]
    },
    {
      "type": "sf.table.individual",
      "meta": {
        "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
        "lineCount": "-1"
      },
      "dataRefs": [
        {
          "uri": "bob1_1907238687.csv",
          "party": "bob",
          "format": "csv"
        }
      ]
    }
  ],
  "checkpointUri": "ckgsid-dwdkvwbe-node-35-output-0"
}
2024-09-14 10:37:50,482|bob|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:50,482|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id astrqxxq to
...........
name: "alice1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "alice1_1010363635.csv"
  party: "alice"
  format: "csv"
}

....
2024-09-14 10:37:50,492|bob|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:50,492|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id yxcxhdat to
...........
name: "bob1"
type: "sf.table.individual"
meta {
  type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
  value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
  uri: "bob1_1907238687.csv"
  party: "bob"
  format: "csv"
}

....
2024-09-14 10:37:50,492|bob|WARNING|secretflow|entry.py:comp_eval:169|
--
Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7
--

2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:170|
--
*param*

domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"

--

2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:171|
--
*storage_config*

type: "local_fs"
local_fs {
  wd: "/home/kuscia/var/storage/data"
}

--

2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:172|
--
*cluster_config*

desc {
  parties: "bob"
  parties: "alice"
  devices {
    name: "spu"
    type: "spu"
    parties: "bob"
    parties: "alice"
    config: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
  }
  devices {
    name: "heu"
    type: "heu"
    parties: "bob"
    parties: "alice"
    config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
  }
  ray_fed_config {
    cross_silo_comm_backend: "brpc_link"
  }
}
public_config {
  ray_fed_config {
    parties: "bob"
    parties: "alice"
    addresses: "0.0.0.0:20395"
    addresses: "gsid-dwdkvwbe-node-35-0-fed.alice.svc:80"
  }
  spu_configs {
    name: "spu"
    parties: "bob"
    parties: "alice"
    addresses: "0.0.0.0:20394"
    addresses: "http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80"
  }
}
private_config {
  self_party: "bob"
  ray_head_addr: "gsid-dwdkvwbe-node-35-0-global.bob.svc:20390"
}

--

2024-09-14 10:37:50,495|bob|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-14 10:37:50,496 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: gsid-dwdkvwbe-node-35-0-global.bob.svc:20390...
2024-09-14 10:37:50,508|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734048 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734048 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734048 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734048 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,513|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734144 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734144 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971733904 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971733904 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734144 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734144 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971733904 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971733904 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516 INFO worker.py:1724 -- Connected to Ray cluster.
2024-09-14 10:37:51.327 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': '0.0.0.0:20395', 'alice': 'http://gsid-dwdkvwbe-node-35-0-fed.alice.svc:80'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}}
(raylet) [2024-09-14 10:37:51,273 I 7581 7581] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=9199) 2024-09-14 10:37:52.620 INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=9199) I0914 10:37:52.646880  9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=20395.
(SenderReceiverProxyActor pid=9199) W0914 10:37:52.646909  9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
(SenderReceiverProxyActor pid=9199) I0914 10:37:53.321158  9421 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240914.103753.9199/id.db and ./rpc_data/rpcz/20240914.103753.9199/time.db
2024-09-14 10:37:53.676 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-09-14 10:37:53.676 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
2024-09-14 10:37:53.685 WARNING psi.py:361 [bob] -- [Anonymous_job] {'cluster_def': {'nodes': [{'party': 'bob', 'address': '0.0.0.0:20394', 'listen_address': ''}, {'party': 'alice', 'address': 'http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80', 'listen_address':''}], 'runtime_config': {'protocol': 2, 'field': 3}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}}
2024-09-14 10:37:55.340 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
  ss: "id"
}
attrs {
  ss: "id"
}
attrs {
  s: "PROTOCOL_RR22"
}
attrs {
  b: true
}
attrs {
  s: "no"
}
attrs {
  is_na: true
}
attrs {
  is_na: true
}
attrs {
  s: "CURVE_FOURQ"
}
inputs {
  name: "alice1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "alice1_1010363635.csv"
    party: "alice"
    format: "csv"
  }
}
inputs {
  name: "bob1"
  type: "sf.table.individual"
  meta {
    type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
    value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
  }
  data_refs {
    uri: "bob1_1907238687.csv"
    party: "bob"
    format: "csv"
  }
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
 failed, error <ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError>
2024-09-14 10:37:55.341 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-14 10:37:55.341 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.
2024-09-14 10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-14 10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-14 10:37:55.342 INFO api.py:384 [bob] -- [Anonymous_job] Shutdowned rayfed.
(raylet) [2024-09-14 10:37:54,186 I 9422 9422] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
    main()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
    res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
    res = comp.eval(
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
    raise e from None
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
    ret = self.__eval_callback(ctx=ctx, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py", line 371, in two_party_balanced_psi_eval_fn
    download_files(ctx, uri, input_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 847, in download_files
    wait(waits)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 213, in wait
    reveal([o.device(lambda o: None)(o) for o in objs])
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
    all_object = sfd.get(all_object_refs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
    return fed.get(object_refs)
  File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
    values = ray.get(ray_refs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
  File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
    comp_storage.download_file(uri, output_path)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
    impl.download_file(remote_fn, local_fn)
  File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
    assert os.path.exists(full_remote_fn)
AssertionError

Meng-xiangkun avatar Sep 14 '24 03:09 Meng-xiangkun

显示实际的物理文件找不到,如果是自定义的用户数据,需要把物理文件分别放到 alice/bob 节点的 /home/kuscia/var/storage/data 目录下 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#id11

yushiqie avatar Sep 14 '24 03:09 yushiqie

显示实际的物理文件找不到,如果是自定义的用户数据,需要把物理文件分别放到 alice/bob 节点的 /home/kuscia/var/storage/data 目录下 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#id11

image 我是用pad前端页面上传的数据源,咋还需要准备数据这一步啊

Meng-xiangkun avatar Sep 14 '24 06:09 Meng-xiangkun

kuscia 是 k8s 部署吗,当前你部署的方式 secretpad 是如何跟 k8s 部署的 kuscia 交互的

yushiqie avatar Sep 14 '24 06:09 yushiqie

kuscia 是 k8s 部署吗,当前你部署的方式 secretpad 是如何跟 k8s 部署的 kuscia 交互的

kuscia 是 k8s 部署的,secretpad是源码打的镜像k8s部署的,部署同一个环境中,下面是secretpad的配置文件

server:
  tomcat:
    accesslog:
      enabled: true
      directory: /var/log/secretpad
  servlet:
    session:
      timeout: 30m
  http-port: 8080
  http-port-inner: 9001
  port: 443
  ssl:
    enabled: true
    key-store: "file:./config/server.jks"
    key-store-password: ${KEY_PASSWORD:secretpad}
    key-alias: secretpad-server
    key-password: ${KEY_PASSWORD:secretpad}
    key-store-type: JKS
  compression:
    enabled: true
    mime-types:
      - application/javascript
      - text/css
    min-response-size: 1024
spring:
  task:
    scheduling:
      pool:
        size: 10
  application:
    name: secretpad
  jpa:
    database-platform: org.hibernate.community.dialect.SQLiteDialect
    show-sql: false
    properties:
      hibernate:
        format_sql: false
    open-in-view: false
  datasource:
    driver-class-name: org.sqlite.JDBC
    url: jdbc:sqlite:./db/secretpad.sqlite
    hikari:
      idle-timeout: 60000
      maximum-pool-size: 1
      connection-timeout: 6000
  flyway:
    baseline-on-migrate: true
    locations:
      - filesystem:./config/schema/center

  #datasource used for mysql
  #spring:
  #  task:
  #    scheduling:
  #      pool:
  #        size: 10
  #  application:
  #    name: secretpad
  #  jpa:
  #    database-platform: org.hibernate.dialect.MySQLDialect
  #    show-sql: false
  #    properties:
  #      hibernate:
  #        format_sql: false
  #  datasource:
  #    driver-class-name: com.mysql.cj.jdbc.Driver
  #    url: your mysql url
  #    username:
  #    password:
  #    hikari:
  #      idle-timeout: 60000
  #      maximum-pool-size: 10
  #      connection-timeout: 5000
  jackson:
    deserialization:
      fail-on-missing-external-type-id-property: false
      fail-on-ignored-properties: false
      fail-on-unknown-properties: false
    serialization:
      fail-on-empty-beans: false
  web:
    locale: zh_CN # default locale, overridden by request "Accept-Language" header.
  cache:
    jcache:
      config:
        classpath:ehcache.xml
springdoc:
  api-docs:
    enabled: true
management:
  endpoints:
    web:
      exposure:
        include: health,info,readiness,prometheus
    enabled-by-default: false
kusciaapi:
  protocol: ${KUSCIA_PROTOCOL:notls}

kuscia:
  nodes:
    - domainId: kuscia-system
      mode: master
      host: ${KUSCIA_API_ADDRESS:kuscia-master.data-develop-operate-dev.svc.cluster.local}
      port: ${KUSCIA_API_PORT:8083}
      protocol: ${KUSCIA_PROTOCOL:notls}
      cert-file: config/certs/client.crt
      key-file: config/certs/client.pem
      token: config/certs/token

    - domainId: alice
      mode: lite
      host: ${KUSCIA_API_LITE_ALICE_ADDRESS:kuscia-lite-alice.data-develop-operate-dev.svc.cluster.local}
      port: ${KUSCIA_API_PORT:8083}
      protocol: ${KUSCIA_PROTOCOL:notls}
      cert-file: config/certs/alice/client.crt
      key-file: config/certs/alice/client.pem
      token: config/certs/alice/token

    - domainId: bob
      mode: lite
      host: ${KUSCIA_API_LITE_BOB_ADDRESS:kuscia-lite-bob.data-develop-operate-dev.svc.cluster.local}
      port: ${KUSCIA_API_PORT:8083}
      protocol: ${KUSCIA_PROTOCOL:notls}
      cert-file: config/certs/bob/client.crt
      key-file: config/certs/bob/client.pem
      token: config/certs/bob/token


job:
  max-parallelism: 1

secretpad:
  logs:
    path: ${SECRETPAD_LOG_PATH:../log}
  deploy-mode: ${DEPLOY_MODE:ALL-IN-ONE} # MPC TEE ALL-IN-ONE
  platform-type: CENTER
  node-id: kuscia-system
  center-platform-service: secretpad.master.svc
  gateway: ${KUSCIA_GW_ADDRESS:127.0.0.1:80}
  auth:
    enabled: true
    pad_name: ${SECRETPAD_USER_NAME}
    pad_pwd: ${SECRETPAD_PASSWORD}
  response:
    extra-headers:
      Content-Security-Policy: "base-uri 'self';frame-src 'self';worker-src blob: 'self' data:;object-src 'self';"
  upload-file:
    max-file-size: -1    # -1 means not limit, e.g.  200MB, 1GB
    max-request-size: -1 # -1 means not limit, e.g.  200MB, 1GB
  data:
    dir-path: /app/data/
  datasync:
    center: true
    p2p: false
  version:
    secretpad-image: ${SECRETPAD_IMAGE:0.5.0b0}
    kuscia-image: ${KUSCIA_IMAGE:0.6.0b0}
    secretflow-image: ${SECRETFLOW_IMAGE:1.4.0b0}
    secretflow-serving-image: ${SECRETFLOW_SERVING_IMAGE:0.2.0b0}
    tee-app-image: ${TEE_APP_IMAGE:0.1.0b0}
    tee-dm-image: ${TEE_DM_IMAGE:0.1.0b0}
    capsule-manager-sim-image: ${CAPSULE_MANAGER_SIM_IMAGE:0.1.2b0}

  component:
    hide:
      - secretflow/io/read_data:0.0.1
      - secretflow/io/write_data:0.0.1
      - secretflow/io/identity:0.0.1
      - secretflow/model/model_export:0.0.1
      - secretflow/ml.train/slnn_train:0.0.1
      - secretflow/ml.predict/slnn_predict:0.0.2

sfclusterDesc:
  deviceConfig:
    spu: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
    heu: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
  rayFedConfig:
    crossSiloCommBackend: "brpc_link"

tee:
  capsule-manager: capsule-manager.#.svc

data:
  sync:
    - org.secretflow.secretpad.persistence.entity.ProjectDO
    - org.secretflow.secretpad.persistence.entity.ProjectNodeDO
    - org.secretflow.secretpad.persistence.entity.NodeDO
    - org.secretflow.secretpad.persistence.entity.NodeRouteDO
    - org.secretflow.secretpad.persistence.entity.ProjectJobDO
    - org.secretflow.secretpad.persistence.entity.ProjectTaskDO
    - org.secretflow.secretpad.persistence.entity.ProjectDatatableDO
    - org.secretflow.secretpad.persistence.entity.VoteRequestDO
    - org.secretflow.secretpad.persistence.entity.VoteInviteDO
    - org.secretflow.secretpad.persistence.entity.TeeDownLoadAuditConfigDO
    - org.secretflow.secretpad.persistence.entity.NodeRouteApprovalConfigDO
    - org.secretflow.secretpad.persistence.entity.TeeNodeDatatableManagementDO
    - org.secretflow.secretpad.persistence.entity.ProjectModelServingDO
    - org.secretflow.secretpad.persistence.entity.ProjectGraphNodeKusciaParamsDO
    - org.secretflow.secretpad.persistence.entity.ProjectModelPackDO
    - org.secretflow.secretpad.persistence.entity.FeatureTableDO
    - org.secretflow.secretpad.persistence.entity.ProjectFeatureTableDO
    - org.secretflow.secretpad.persistence.entity.ProjectGraphDomainDatasourceDO

inner-port:
  path:
    - /api/v1alpha1/vote_sync/create
    - /api/v1alpha1/user/node/resetPassword
    - /sync
    - /api/v1alpha1/data/sync
# ip block config (None of them are allowed in the configured IP list)
ip:
  block:
    enable: true
    list:
      - 0.0.0.0/32
      - 127.0.0.1/8
      - 10.0.0.0/8
      - 11.0.0.0/8
      - 30.0.0.0/8
      - 100.64.0.0/10
      - 172.16.0.0/12
      - 192.168.0.0/16
      - 33.0.0.0/8

Meng-xiangkun avatar Sep 14 '24 06:09 Meng-xiangkun

docker 部署的 secretpad + kuscia 通过挂载统一个数据目录实现 secretpad 和 kuscia 数据共享。k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,k8s 方式推荐 oss 数据源

yushiqie avatar Sep 14 '24 06:09 yushiqie

docker 部署的 secretpad + kuscia 通过挂载统一个数据目录实现 secretpad 和 kuscia 数据共享。k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,k8s 方式推荐 oss 数据源

好的,k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,这个方法有文档吗

Meng-xiangkun avatar Sep 14 '24 06:09 Meng-xiangkun

目前没有,可以找一下 k8s 部署文档

yushiqie avatar Sep 14 '24 06:09 yushiqie