K8s部署kuscia中心化集群,Runp模式,执行隐私计算有问题
Issue Type
Feature
Search for existing issues similar to yours
Yes
Kuscia Version
0.10.0b0
Link to Relevant Documentation
No response
Question Details
使用kuscia-secretflow:laster镜像做隐私求交计算时出现这个错误
Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
Failed to update kuscia job "dppm" status, Operation cannot be fulfilled on kusciajobs.kuscia.secretflow "dppm": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 18:30:34.303 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.317 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.317 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.420693ms)
2024-09-12 18:30:34.317 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.470899ms)
2024-09-12 18:30:34.317 INFO resources/kusciajob.go:82 update kuscia job dppm
2024-09-12 18:30:34.329 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (12.672843ms)
2024-09-12 18:30:34.330 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.343 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.343 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.248207ms)
2024-09-12 18:30:34.343 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.29884ms)
2024-09-12 18:30:34.345 INFO handler/job_scheduler.go:323 Create kuscia tasks: dppm-qvxgwzap-node-35
2024-09-12 18:30:34.357 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.369 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing
2024-09-12 18:30:34.369 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
2024-09-12 18:30:34.370 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.370 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (25.113735ms)
2024-09-12 18:30:34.370 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (25.15742ms)
2024-09-12 18:30:34.370 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=}}, kusciaJobId=dppm
2024-09-12 18:30:34.370 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.383 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing
2024-09-12 18:30:34.383 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
2024-09-12 18:30:34.385 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.386 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (15.795756ms)
2024-09-12 18:30:34.386 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (15.879731ms)
2024-09-12 18:30:34.388 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=}}, kusciaJobId=dppm
2024-09-12 18:30:34.388 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (488.279µs)
2024-09-12 18:30:34.399 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing
2024-09-12 18:30:34.399 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
2024-09-12 18:30:34.423 WARN kusciatask/controller.go:424 Error handling "dppm-qvxgwzap-node-35", re-queuing
2024-09-12 18:30:34.424 ERROR kusciatask/controller.go:435 Failed to process object: error handling "dppm-qvxgwzap-node-35", failed to process kusciaTask "dppm-qvxgwzap-node-35", failed to build domain bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow "secretflow-image" not found, retry
2024-09-12 18:30:34.472 INFO resources/kusciatask.go:69 Start updating kuscia task "dppm-qvxgwzap-node-35" status
2024-09-12 18:30:34.488 INFO resources/kusciatask.go:71 Finish updating kuscia task "dppm-qvxgwzap-node-35" status
2024-09-12 18:30:34.488 INFO kusciatask/controller.go:521 Finished syncing kusciatask "dppm-qvxgwzap-node-35" (24.193535ms)
2024-09-12 18:30:34.490 INFO handler/job_scheduler.go:661 jobStatusPhaseFrom readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=Failed}}, kusciaJobId=dppm
2024-09-12 18:30:34.490 INFO handler/job_scheduler.go:679 jobStatusPhaseFrom failed readyTasks={}, tasks={{taskId=dppm-qvxgwzap-node-35, dependencies=[], tolerable=false, phase=Failed}}, kusciaJobId=dppm
2024-09-12 18:30:34.491 WARN handler/failed_handler.go:62 Get task resource group dppm-qvxgwzap-node-35 failed, skip setting its status to failed, taskresourcegroup.kuscia.secretflow "dppm-qvxgwzap-node-35" not found
2024-09-12 18:30:34.491 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.491 INFO resources/kusciatask.go:69 Start updating kuscia task "dppm-qvxgwzap-node-35" status
2024-09-12 18:30:34.505 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.505 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (14.950352ms)
2024-09-12 18:30:34.505 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (14.972553ms)
2024-09-12 18:30:34.510 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.510 INFO resources/kusciatask.go:71 Finish updating kuscia task "dppm-qvxgwzap-node-35" status
2024-09-12 18:30:34.510 INFO kusciatask/controller.go:521 Finished syncing kusciatask "dppm-qvxgwzap-node-35" (19.491329ms)
2024-09-12 18:30:34.510 INFO kusciatask/controller.go:489 KusciaTask "dppm-qvxgwzap-node-35" was finished, skipping
2024-09-12 18:30:34.523 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.523 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (13.33302ms)
2024-09-12 18:30:34.523 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (13.376915ms)
2024-09-12 18:30:34.523 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.534 WARN resources/kusciajob.go:122 Failed to update kuscia job "dppm" status, Operation cannot be fulfilled on kusciajobs.kuscia.secretflow "dppm": the object has been modified; please apply your changes to the latest version and try again
2024-09-12 18:30:34.542 INFO resources/kusciajob.go:116 Start updating kuscia job "dppm" status
2024-09-12 18:30:34.554 INFO resources/kusciajob.go:118 Finish updating kuscia job "dppm" status
2024-09-12 18:30:34.555 INFO kusciajob/controller.go:298 Finished syncing KusciaJob "dppm" (31.853225ms)
2024-09-12 18:30:34.555 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (31.901265ms)
2024-09-12 18:30:34.555 INFO handler/job_scheduler.go:700 KusciaJob dppm was finished, skipping
2024-09-12 18:30:34.555 INFO kusciajob/controller.go:266 KusciaJob "dppm" should not reconcile again, skipping
2024-09-12 18:30:34.555 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-job-controller], key[dppm] (111.519µs)
异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息
异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息
是这样吗?
可以加上你的namespace(-n name),或者-A查看所有
可以加上你的namespace(-n name),或者-A查看所有
还是一样的
异常日志显示“secretflow-image”镜像缺失,可以通过 kuscia get appimage 查看镜像是否存在;如果镜像是存在的,可以进一步提供下pod引擎日志信息
作业任务详细信息
sh-4.4# kubectl get kt jaqj-qvxgwzap-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
creationTimestamp: "2024-09-12T10:49:29Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/job-id: jaqj
kuscia.secretflow/self-cluster-as-initiator: "true"
kuscia.secretflow/task-alias: jaqj-qvxgwzap-node-35
name: jaqj-qvxgwzap-node-35
ownerReferences:
- apiVersion: kuscia.secretflow/v1alpha1
blockOwnerDeletion: true
controller: true
kind: KusciaJob
name: jaqj
uid: 9a2a5920-c23d-409d-afdc-14d82e5e53e4
resourceVersion: "14340"
uid: 73a41e0d-4b9d-4d03-b5eb-261efb760b15
spec:
initiator: bob
parties:
- appImageRef: secretflow-image
domainID: bob
template:
spec: {}
- appImageRef: secretflow-image
domainID: alice
template:
spec: {}
scheduleConfig: {}
taskInputConfig: |-
{
"sf_datasource_config": {
"bob": {
"id": "default-data-source"
},
"alice": {
"id": "default-data-source"
}
},
"sf_cluster_desc": {
"parties": ["bob", "alice"],
"devices": [{
"name": "spu",
"type": "spu",
"parties": ["bob", "alice"],
"config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
}, {
"name": "heu",
"type": "heu",
"parties": ["bob", "alice"],
"config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
}],
"ray_fed_config": {
"cross_silo_comm_backend": "brpc_link"
}
},
"sf_node_eval_param": {
"domain": "data_prep",
"name": "psi",
"version": "0.0.5",
"attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
"attrs": [{
"is_na": false,
"ss": ["id1"]
}, {
"is_na": false,
"ss": ["id2"]
}, {
"is_na": false,
"s": "PROTOCOL_RR22"
}, {
"b": true,
"is_na": false
}, {
"is_na": false,
"s": "no"
}, {
"is_na": true
}, {
"is_na": true
}, {
"is_na": false,
"s": "CURVE_FOURQ"
}],
"inputs": [{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "alice.csv",
"party": "alice",
"format": "csv"
}]
}, {
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "bob.csv",
"party": "bob",
"format": "csv"
}]
}],
"checkpoint_uri": "ckjaqj-qvxgwzap-node-35-output-0"
},
"sf_output_uris": ["jaqj-qvxgwzap-node-35-output-0"],
"sf_input_ids": ["alice-table", "bob-table"],
"sf_output_ids": ["jaqj-qvxgwzap-node-35-output-0"]
}
status:
completionTime: "2024-09-12T10:49:29Z"
conditions:
- lastTransitionTime: "2024-09-12T10:49:29Z"
message: Failed to create kusciaTask related resources, failed to build domain
bob kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow
"secretflow-image" not found
reason: KusciaTaskCreateFailed
status: "False"
type: ResourceCreated
lastReconcileTime: "2024-09-12T10:49:29Z"
message: 'KusciaTask failed after 3x retry, last error: failed to build domain bob
kit info, failed to get appImage "secretflow-image" from cache, appimage.kuscia.secretflow
"secretflow-image" not found'
phase: Failed
startTime: "2024-09-12T10:49:29Z"
再检查一下部署节点步骤,appimage 需要手动创建 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#appimage
再检查一下部署节点步骤,appimage 需要手动创建 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#appimage
文件不存在,这个文件要放到哪里啊
文件不存在,这个文件要放到哪里啊
文件不存在,这个文件要放到哪里啊
可以查看下当前路径下是否有AppImage.yaml这个文件
文件不存在,这个文件要放到哪里啊
可以查看下当前路径下是否有AppImage.yaml这个文件
没有这个文件,上传一份吗,上传到那个位置呀
上面的问题好了,现在任务一直pending,获取不到secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8这个镜像,还有其他方法使用这个镜像吗,除了从secretflow-registry.cn-hangzhou.cr.aliyuncs.com拉取,集群环境不允许拉取外部镜像。
sh-4.4# kubectl get kt -n cross-domain
NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE
fpvu-alice 3m30s 3m30s 3m30s Failed
gere-bob 3m30s 3m30s 3m30s Failed
alzf-qvxgwzap-node-35 2m36s 2m18s Pending
sh-4.4# kubectl get kt alzf-qvxgwzap-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
creationTimestamp: "2024-09-13T06:27:39Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/job-id: alzf
kuscia.secretflow/self-cluster-as-initiator: "true"
kuscia.secretflow/task-alias: alzf-qvxgwzap-node-35
name: alzf-qvxgwzap-node-35
ownerReferences:
- apiVersion: kuscia.secretflow/v1alpha1
blockOwnerDeletion: true
controller: true
kind: KusciaJob
name: alzf
uid: 1c3bf688-1a1d-4ba1-98dc-9239ec113ebd
resourceVersion: "2736"
uid: 7a1f8356-82da-44f4-8b10-cab10b0a87be
spec:
initiator: bob
parties:
- appImageRef: secretflow-image
domainID: bob
template:
spec: {}
- appImageRef: secretflow-image
domainID: alice
template:
spec: {}
scheduleConfig: {}
taskInputConfig: |-
{
"sf_datasource_config": {
"bob": {
"id": "default-data-source"
},
"alice": {
"id": "default-data-source"
}
},
"sf_cluster_desc": {
"parties": ["bob", "alice"],
"devices": [{
"name": "spu",
"type": "spu",
"parties": ["bob", "alice"],
"config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
}, {
"name": "heu",
"type": "heu",
"parties": ["bob", "alice"],
"config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
}],
"ray_fed_config": {
"cross_silo_comm_backend": "brpc_link"
}
},
"sf_node_eval_param": {
"domain": "data_prep",
"name": "psi",
"version": "0.0.5",
"attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
"attrs": [{
"is_na": false,
"ss": ["id1"]
}, {
"is_na": false,
"ss": ["id2"]
}, {
"is_na": false,
"s": "PROTOCOL_RR22"
}, {
"b": true,
"is_na": false
}, {
"is_na": false,
"s": "no"
}, {
"is_na": true
}, {
"is_na": true
}, {
"is_na": false,
"s": "CURVE_FOURQ"
}],
"inputs": [{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "alice.csv",
"party": "alice",
"format": "csv"
}]
}, {
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "bob.csv",
"party": "bob",
"format": "csv"
}]
}],
"checkpoint_uri": "ckalzf-qvxgwzap-node-35-output-0"
},
"sf_output_uris": ["alzf-qvxgwzap-node-35-output-0"],
"sf_input_ids": ["alice-table", "bob-table"],
"sf_output_ids": ["alzf-qvxgwzap-node-35-output-0"]
}
status:
allocatedPorts:
- domainID: alice
namedPort:
alzf-qvxgwzap-node-35-0/client-server: 31454
alzf-qvxgwzap-node-35-0/fed: 31450
alzf-qvxgwzap-node-35-0/global: 31451
alzf-qvxgwzap-node-35-0/node-manager: 31452
alzf-qvxgwzap-node-35-0/object-manager: 31453
alzf-qvxgwzap-node-35-0/spu: 31449
- domainID: bob
namedPort:
alzf-qvxgwzap-node-35-0/client-server: 32739
alzf-qvxgwzap-node-35-0/fed: 32741
alzf-qvxgwzap-node-35-0/global: 32742
alzf-qvxgwzap-node-35-0/node-manager: 32737
alzf-qvxgwzap-node-35-0/object-manager: 32738
alzf-qvxgwzap-node-35-0/spu: 32740
conditions:
- lastTransitionTime: "2024-09-13T06:27:39Z"
status: "True"
type: ResourceCreated
lastReconcileTime: "2024-09-13T06:27:57Z"
phase: Pending
podStatuses:
alice/alzf-qvxgwzap-node-35-0:
createTime: "2024-09-13T06:27:39Z"
message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
"Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
not exist in local repository"'
namespace: alice
nodeName: kuscia-lite-alice-9b7cdf6fd-l8dt5
podName: alzf-qvxgwzap-node-35-0
podPhase: Pending
reason: ImageInspectError
startTime: "2024-09-13T06:27:41Z"
bob/alzf-qvxgwzap-node-35-0:
createTime: "2024-09-13T06:27:39Z"
message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
"Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
not exist in local repository"'
namespace: bob
nodeName: kuscia-lite-bob-7df5b89f5-vcrl9
podName: alzf-qvxgwzap-node-35-0
podPhase: Pending
reason: ImageInspectError
startTime: "2024-09-13T06:27:41Z"
serviceStatuses:
alice/alzf-qvxgwzap-node-35-0-fed:
createTime: "2024-09-13T06:27:39Z"
namespace: alice
portName: fed
portNumber: 31450
readyTime: "2024-09-13T06:27:41Z"
scope: Cluster
serviceName: alzf-qvxgwzap-node-35-0-fed
alice/alzf-qvxgwzap-node-35-0-global:
createTime: "2024-09-13T06:27:39Z"
namespace: alice
portName: global
portNumber: 31451
readyTime: "2024-09-13T06:27:41Z"
scope: Domain
serviceName: alzf-qvxgwzap-node-35-0-global
alice/alzf-qvxgwzap-node-35-0-spu:
createTime: "2024-09-13T06:27:39Z"
namespace: alice
portName: spu
portNumber: 31449
readyTime: "2024-09-13T06:27:41Z"
scope: Cluster
serviceName: alzf-qvxgwzap-node-35-0-spu
bob/alzf-qvxgwzap-node-35-0-fed:
createTime: "2024-09-13T06:27:39Z"
namespace: bob
portName: fed
portNumber: 32741
readyTime: "2024-09-13T06:27:41Z"
scope: Cluster
serviceName: alzf-qvxgwzap-node-35-0-fed
bob/alzf-qvxgwzap-node-35-0-global:
createTime: "2024-09-13T06:27:39Z"
namespace: bob
portName: global
portNumber: 32742
readyTime: "2024-09-13T06:27:41Z"
scope: Domain
serviceName: alzf-qvxgwzap-node-35-0-global
bob/alzf-qvxgwzap-node-35-0-spu:
createTime: "2024-09-13T06:27:39Z"
namespace: bob
portName: spu
portNumber: 32740
readyTime: "2024-09-13T06:27:41Z"
scope: Cluster
serviceName: alzf-qvxgwzap-node-35-0-spu
startTime: "2024-09-13T06:27:39Z"
kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:
- 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
- 升级 kuscia 版本到 0.11.x
kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:
- 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
- 升级 kuscia 版本到 0.11.x
ERROR: failed to solve: secretflow/anolis8-python:3.10.13: failed to resolve source metadata for docker.io/secretflow/anolis8-python:3.10.13: failed to do request: Head "https://registry-1.docker.io/v2/secretflow/anolis8-python/manifests/3.10.13": dial tcp 108.160.169.185:443: connect: connection refused 这个镜像还有别的地址能拉取吗
可以用 secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/anolis8-python:3.10.13
https://github.com/secretflow/secretpad/issues/130
kuscia 0.10.x 版本 runp 容器运行时任务镜像不支持动态拉取,可以采取以下措施:
- 通过 docker build -f kuscia-secretflow.Dockerfile . 将 kuscia 和 secretflow 打包在一起 kuscia-secretflow.Dockerfile
- 升级 kuscia 版本到 0.11.x
我使用了 将 kuscia 和 secretflow 打包在一起的镜像,还是报这个错 "Failed to inspect image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0": failed to get image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0" manifest, detail-> image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0" not exist in local repository"' 是需要配置修改什么吗?才能找到镜像
看下 dockerfile 默认导入的 secretflow 版本 https://github.com/secretflow/kuscia/blob/release/0.10.x/build/dockerfile/kuscia-secretflow.Dockerfile#L15
看下 dockerfile 默认导入的 secretflow 版本 https://github.com/secretflow/kuscia/blob/release/0.10.x/build/dockerfile/kuscia-secretflow.Dockerfile#L15
镜像问题解决了,现在用本地上传的数据集进行隐私求交计算的时候失败了
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
annotations:
kuscia.secretflow/job-id: gsid
kuscia.secretflow/self-cluster-as-participant: "true"
kuscia.secretflow/task-alias: gsid-dwdkvwbe-node-35
creationTimestamp: "2024-09-14T02:37:41Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/job-uid: 25d3045f-2277-41d3-8cb6-eeb23747073b
name: gsid-dwdkvwbe-node-35
namespace: cross-domain
ownerReferences:
- apiVersion: kuscia.secretflow/v1alpha1
blockOwnerDeletion: true
controller: true
kind: KusciaJob
name: gsid
uid: 25d3045f-2277-41d3-8cb6-eeb23747073b
resourceVersion: "12285"
uid: 3f11ec51-7e6c-4928-89f6-b16374ef50b5
spec:
initiator: bob
parties:
- appImageRef: secretflow-image
domainID: bob
template:
spec: {}
- appImageRef: secretflow-image
domainID: alice
template:
spec: {}
scheduleConfig: {}
taskInputConfig: |-
{
"sf_datasource_config": {
"bob": {
"id": "default-data-source"
},
"alice": {
"id": "default-data-source"
}
},
"sf_cluster_desc": {
"parties": ["bob", "alice"],
"devices": [{
"name": "spu",
"type": "spu",
"parties": ["bob", "alice"],
"config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
}, {
"name": "heu",
"type": "heu",
"parties": ["bob", "alice"],
"config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
}],
"ray_fed_config": {
"cross_silo_comm_backend": "brpc_link"
}
},
"sf_node_eval_param": {
"domain": "data_prep",
"name": "psi",
"version": "0.0.5",
"attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
"attrs": [{
"is_na": false,
"ss": ["id"]
}, {
"is_na": false,
"ss": ["id"]
}, {
"is_na": false,
"s": "PROTOCOL_RR22"
}, {
"b": true,
"is_na": false
}, {
"is_na": false,
"s": "no"
}, {
"is_na": true
}, {
"is_na": true
}, {
"is_na": false,
"s": "CURVE_FOURQ"
}],
"inputs": [{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "alice1_1010363635.csv",
"party": "alice",
"format": "csv"
}]
}, {
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "bob1_1907238687.csv",
"party": "bob",
"format": "csv"
}]
}],
"checkpoint_uri": "ckgsid-dwdkvwbe-node-35-output-0"
},
"sf_output_uris": ["gsid-dwdkvwbe-node-35-output-0"],
"sf_input_ids": ["astrqxxq", "yxcxhdat"],
"sf_output_ids": ["gsid-dwdkvwbe-node-35-output-0"]
}
status:
allocatedPorts:
- domainID: bob
namedPort:
gsid-dwdkvwbe-node-35-0/client-server: 20393
gsid-dwdkvwbe-node-35-0/fed: 20395
gsid-dwdkvwbe-node-35-0/global: 20390
gsid-dwdkvwbe-node-35-0/node-manager: 20391
gsid-dwdkvwbe-node-35-0/object-manager: 20392
gsid-dwdkvwbe-node-35-0/spu: 20394
- domainID: alice
namedPort:
gsid-dwdkvwbe-node-35-0/client-server: 21057
gsid-dwdkvwbe-node-35-0/fed: 21059
gsid-dwdkvwbe-node-35-0/global: 21054
gsid-dwdkvwbe-node-35-0/node-manager: 21055
gsid-dwdkvwbe-node-35-0/object-manager: 21056
gsid-dwdkvwbe-node-35-0/spu: 21058
completionTime: "2024-09-14T02:37:57Z"
conditions:
- lastTransitionTime: "2024-09-14T02:37:41Z"
status: "True"
type: ResourceCreated
- lastTransitionTime: "2024-09-14T02:37:43Z"
status: "True"
type: Running
- lastTransitionTime: "2024-09-14T02:37:57Z"
status: "False"
type: Success
lastReconcileTime: "2024-09-14T02:37:57Z"
message: The remaining no-failed party task counts 1 are less than the threshold
2 that meets the conditions for task success. pending party[], running party[alice],
successful party[], failed party[bob]
partyTaskStatus:
- domainID: bob
phase: Failed
- domainID: alice
phase: Failed
phase: Failed
podStatuses:
alice/gsid-dwdkvwbe-node-35-0:
createTime: "2024-09-14T02:37:41Z"
namespace: alice
nodeName: kuscia-lite-alice-784b59647f-55mdx
podName: gsid-dwdkvwbe-node-35-0
podPhase: Failed
readyTime: "2024-09-14T02:37:44Z"
startTime: "2024-09-14T02:37:43Z"
bob/gsid-dwdkvwbe-node-35-0:
createTime: "2024-09-14T02:37:41Z"
namespace: bob
nodeName: kuscia-lite-bob-6d7d6c998f-zhtll
podName: gsid-dwdkvwbe-node-35-0
podPhase: Failed
readyTime: "2024-09-14T02:37:43Z"
reason: Error
startTime: "2024-09-14T02:37:43Z"
terminationLog: 'container[secretflow] terminated state reason "Error", message:
"... Ignore 12413 characters at the beginning ...\ning_failure'': True}\n\x1b[36m(SenderReceiverProxyActor
pid=9199)\x1b[0m I0914 10:37:52.646880 9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1181]
Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on
port=20395.\n\x1b[36m(SenderReceiverProxyActor pid=9199)\x1b[0m W0914 10:37:52.646909 9199
external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are
disabled according to ServerOptions.has_builtin_services\n\x1b[36m(SenderReceiverProxyActor
pid=9199)\x1b[0m I0914 10:37:53.321158 9421 external/com_github_brpc_brpc/src/brpc/span.cpp:506]
Opened ./rpc_data/rpcz/20240914.103753.9199/id.db and ./rpc_data/rpcz/20240914.103753.9199/time.db\n2024-09-14
10:37:53.676 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create
receiver proxy actor.\n2024-09-14 10:37:53.676 INFO barriers.py:520 [bob]
-- [Anonymous_job] Try ping [''alice''] at 0 attemp, up to 3600 attemps.\n2024-09-14
10:37:53.685 WARNING psi.py:361 [bob] -- [Anonymous_job] {''cluster_def'':
{''nodes'': [{''party'': ''bob'', ''address'': ''0.0.0.0:20394'', ''listen_address'':
''''}, {''party'': ''alice'', ''address'': ''http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80'',
''listen_address'': ''''}], ''runtime_config'': {''protocol'': 2, ''field'':
3}}, ''link_desc'': {''connect_retry_times'': 60, ''connect_retry_interval_ms'':
1000, ''brpc_channel_protocol'': ''http'', ''brpc_channel_connection_type'':
''pooled'', ''recv_timeout_ms'': 1200000, ''http_timeout_ms'': 1200000}}\n2024-09-14
10:37:55.340 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain:
\"data_prep\"\nname: \"psi\"\nversion: \"0.0.5\"\nattr_paths: \"input/receiver_input/key\"\nattr_paths:
\"input/sender_input/key\"\nattr_paths: \"protocol\"\nattr_paths: \"sort_result\"\nattr_paths:
\"allow_duplicate_keys\"\nattr_paths: \"allow_duplicate_keys/no/skip_duplicates_check\"\nattr_paths:
\"fill_value_int\"\nattr_paths: \"ecdh_curve\"\nattrs {\n ss: \"id\"\n}\nattrs
{\n ss: \"id\"\n}\nattrs {\n s: \"PROTOCOL_RR22\"\n}\nattrs {\n b: true\n}\nattrs
{\n s: \"no\"\n}\nattrs {\n is_na: true\n}\nattrs {\n is_na: true\n}\nattrs
{\n s: \"CURVE_FOURQ\"\n}\ninputs {\n name: \"alice1\"\n type: \"sf.table.individual\"\n meta
{\n type_url: \"type.googleapis.com/secretflow.spec.v1.IndividualTable\"\n value:
\"\\n\\t\\022\\002id*\\003int\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"\n }\n data_refs
{\n uri: \"alice1_1010363635.csv\"\n party: \"alice\"\n format: \"csv\"\n }\n}\ninputs
{\n name: \"bob1\"\n type: \"sf.table.individual\"\n meta {\n type_url:
\"type.googleapis.com/secretflow.spec.v1.IndividualTable\"\n value: \"\\n\\t\\022\\002id*\\003int\\020\\377\\377\\377\\377\\377\\377\\377\\377\\377\\001\"\n }\n data_refs
{\n uri: \"bob1_1907238687.csv\"\n party: \"bob\"\n format: \"csv\"\n }\n}\noutput_uris:
\"gsid-dwdkvwbe-node-35-output-0\"\ncheckpoint_uri: \"ckgsid-dwdkvwbe-node-35-output-0\"\n
failed, error <\x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n At
least one of the input arguments for this task could not be computed:\nray.exceptions.RayTaskError:
\x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
line 156, in _run\n return fn(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
line 839, in download_file\n comp_storage.download_file(uri, output_path)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
line 32, in download_file\n impl.download_file(remote_fn, local_fn)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
line 171, in download_file\n assert os.path.exists(full_remote_fn)\nAssertionError>\n2024-09-14
10:37:55.341 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...\n2024-09-14
10:37:55.341 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.\n2024-09-14
10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message
polling thread[DataSendingQueueThread] to exit.\n2024-09-14 10:37:55.342 INFO
message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread]
to exit.\n2024-09-14 10:37:55.342 INFO api.py:384 [bob] -- [Anonymous_job]
Shutdowned rayfed.\n\x1b[33m(raylet)\x1b[0m [2024-09-14 10:37:54,186 I 9422
9422] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL
to -1\x1b[32m [repeated 3x across cluster] (Ray deduplicates logs by default.
Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication
for more options.)\x1b[0m\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.10/runpy.py\",
line 196, in _run_module_as_main\n return _run_code(code, main_globals,
None,\n File \"/usr/local/lib/python3.10/runpy.py\", line 86, in _run_code\n exec(code,
run_globals)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
line 547, in <module>\n main()\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
line 1078, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File
\"/usr/local/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return
__callback(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
line 527, in main\n res = comp_eval(sf_node_eval_param, storage_config,
sf_cluster_config)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\",
line 176, in comp_eval\n res = comp.eval(\n File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
line 1132, in eval\n raise e from None\n File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
line 1127, in eval\n ret = self.__eval_callback(ctx=ctx, **kwargs)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py\",
line 371, in two_party_balanced_psi_eval_fn\n download_files(ctx, uri,
input_path)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
line 847, in download_files\n wait(waits)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\",
line 213, in wait\n reveal([o.device(lambda o: None)(o) for o in objs])\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\", line
162, in reveal\n all_object = sfd.get(all_object_refs)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\",
line 156, in get\n return fed.get(object_refs)\n File \"/usr/local/lib/python3.10/site-packages/fed/api.py\",
line 621, in get\n values = ray.get(ray_refs)\n File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\",
line 22, in auto_init_wrapper\n return fn(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\",
line 103, in wrapper\n return func(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\",
line 2624, in get\n raise value.as_instanceof_cause()\nray.exceptions.RayTaskError(AssertionError):
\x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n At
least one of the input arguments for this task could not be computed:\nray.exceptions.RayTaskError:
\x1b[36mray::_run()\x1b[39m (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
line 156, in _run\n return fn(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
line 839, in download_file\n comp_storage.download_file(uri, output_path)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
line 32, in download_file\n impl.download_file(remote_fn, local_fn)\n File
\"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
line 171, in download_file\n assert os.path.exists(full_remote_fn)\nAssertionError\n"'
serviceStatuses:
alice/gsid-dwdkvwbe-node-35-0-fed:
createTime: "2024-09-14T02:37:41Z"
namespace: alice
portName: fed
portNumber: 21059
readyTime: "2024-09-14T02:37:44Z"
scope: Cluster
serviceName: gsid-dwdkvwbe-node-35-0-fed
alice/gsid-dwdkvwbe-node-35-0-global:
createTime: "2024-09-14T02:37:41Z"
namespace: alice
portName: global
portNumber: 21054
readyTime: "2024-09-14T02:37:44Z"
scope: Domain
serviceName: gsid-dwdkvwbe-node-35-0-global
alice/gsid-dwdkvwbe-node-35-0-spu:
createTime: "2024-09-14T02:37:41Z"
namespace: alice
portName: spu
portNumber: 21058
readyTime: "2024-09-14T02:37:44Z"
scope: Cluster
serviceName: gsid-dwdkvwbe-node-35-0-spu
bob/gsid-dwdkvwbe-node-35-0-fed:
createTime: "2024-09-14T02:37:41Z"
namespace: bob
portName: fed
portNumber: 20395
readyTime: "2024-09-14T02:37:43Z"
scope: Cluster
serviceName: gsid-dwdkvwbe-node-35-0-fed
bob/gsid-dwdkvwbe-node-35-0-global:
createTime: "2024-09-14T02:37:41Z"
namespace: bob
portName: global
portNumber: 20390
readyTime: "2024-09-14T02:37:43Z"
scope: Domain
serviceName: gsid-dwdkvwbe-node-35-0-global
bob/gsid-dwdkvwbe-node-35-0-spu:
createTime: "2024-09-14T02:37:41Z"
namespace: bob
portName: spu
portNumber: 20394
readyTime: "2024-09-14T02:37:43Z"
scope: Cluster
serviceName: gsid-dwdkvwbe-node-35-0-spu
startTime: "2024-09-14T02:37:41Z"
参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6
参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6
alice节点下的pod日志
WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
2024-09-14 10:37:47,052|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='gsid-dwdkvwbe-node-35-0-global.alice.svc', ray_node_manager_port=21055, ray_object_manager_port=21056, ray_client_server_port=21057, ray_worker_ports=[], ray_gcs_port=21054)
2024-09-14 10:37:47,058|alice|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at gsid-dwdkvwbe-node-35-0-global.alice.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=gsid-dwdkvwbe-node-35-0-global.alice.svc --port=21054 --node-manager-port=21055 --object-manager-port=21056 --ray-client-server-port=21057
2024-09-14 10:37:51,042|alice|INFO|secretflow|entry.py:start_ray:80| 2024-09-14 10:37:47,713 INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-09-14 10:37:47,713 INFO scripts.py:744 -- Local node IP: gsid-dwdkvwbe-node-35-0-global.alice.svc
2024-09-14 10:37:50,726 SUCC scripts.py:781 -- --------------------
2024-09-14 10:37:50,727 SUCC scripts.py:782 -- Ray runtime started.
2024-09-14 10:37:50,727 SUCC scripts.py:783 -- --------------------
2024-09-14 10:37:50,727 INFO scripts.py:785 -- Next steps
2024-09-14 10:37:50,727 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-09-14 10:37:50,727 INFO scripts.py:791 -- ray start --address='gsid-dwdkvwbe-node-35-0-global.alice.svc:21054'
2024-09-14 10:37:50,727 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-09-14 10:37:50,728 INFO scripts.py:802 -- import ray
2024-09-14 10:37:50,728 INFO scripts.py:803 -- ray.init(_node_ip_address='gsid-dwdkvwbe-node-35-0-global.alice.svc')
2024-09-14 10:37:50,728 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-09-14 10:37:50,728 INFO scripts.py:835 -- ray stop
2024-09-14 10:37:50,728 INFO scripts.py:838 -- To view the status of the cluster, use
2024-09-14 10:37:50,728 INFO scripts.py:839 -- ray status
2024-09-14 10:37:51,042|alice|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at gsid-dwdkvwbe-node-35-0-global.alice.svc.
2024-09-14 10:37:51,047|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param {
"domain": "data_prep",
"name": "psi",
"version": "0.0.5",
"attrPaths": [
"input/receiver_input/key",
"input/sender_input/key",
"protocol",
"sort_result",
"allow_duplicate_keys",
"allow_duplicate_keys/no/skip_duplicates_check",
"fill_value_int",
"ecdh_curve"
],
"attrs": [
{
"ss": [
"id"
]
},
{
"ss": [
"id"
]
},
{
"s": "PROTOCOL_RR22"
},
{
"b": true
},
{
"s": "no"
},
{
"isNa": true
},
{
"isNa": true
},
{
"s": "CURVE_FOURQ"
}
],
"inputs": [
{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"lineCount": "-1"
},
"dataRefs": [
{
"uri": "alice1_1010363635.csv",
"party": "alice",
"format": "csv"
}
]
},
{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"lineCount": "-1"
},
"dataRefs": [
{
"uri": "bob1_1907238687.csv",
"party": "bob",
"format": "csv"
}
]
}
],
"checkpointUri": "ckgsid-dwdkvwbe-node-35-output-0"
}
2024-09-14 10:37:51,059|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:51,059|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id astrqxxq to
...........
name: "alice1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "alice1_1010363635.csv"
party: "alice"
format: "csv"
}
....
2024-09-14 10:37:51,070|alice|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:51,070|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id yxcxhdat to
...........
name: "bob1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "bob1_1907238687.csv"
party: "bob"
format: "csv"
}
....
2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:169|
--
Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7
--
2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:170|
--
*param*
domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
ss: "id"
}
attrs {
ss: "id"
}
attrs {
s: "PROTOCOL_RR22"
}
attrs {
b: true
}
attrs {
s: "no"
}
attrs {
is_na: true
}
attrs {
is_na: true
}
attrs {
s: "CURVE_FOURQ"
}
inputs {
name: "alice1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "alice1_1010363635.csv"
party: "alice"
format: "csv"
}
}
inputs {
name: "bob1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "bob1_1907238687.csv"
party: "bob"
format: "csv"
}
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
--
2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:171|
--
*storage_config*
type: "local_fs"
local_fs {
wd: "/home/kuscia/var/storage/data"
}
--
2024-09-14 10:37:51,071|alice|WARNING|secretflow|entry.py:comp_eval:172|
--
*cluster_config*
desc {
parties: "bob"
parties: "alice"
devices {
name: "spu"
type: "spu"
parties: "bob"
parties: "alice"
config: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
}
devices {
name: "heu"
type: "heu"
parties: "bob"
parties: "alice"
config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
}
ray_fed_config {
cross_silo_comm_backend: "brpc_link"
}
}
public_config {
ray_fed_config {
parties: "bob"
parties: "alice"
addresses: "gsid-dwdkvwbe-node-35-0-fed.bob.svc:80"
addresses: "0.0.0.0:21059"
}
spu_configs {
name: "spu"
parties: "bob"
parties: "alice"
addresses: "http://gsid-dwdkvwbe-node-35-0-spu.bob.svc:80"
addresses: "0.0.0.0:21058"
}
}
private_config {
self_party: "alice"
ray_head_addr: "gsid-dwdkvwbe-node-35-0-global.alice.svc:21054"
}
--
2024-09-14 10:37:51,074|alice|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-14 10:37:51,074 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: gsid-dwdkvwbe-node-35-0-global.alice.svc:21054...
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005728 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005728 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,087|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005728 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,088|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005728 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/node_ip_address.json.lock
2024-09-14 10:37:51,092|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,092|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005824 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005824 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005584 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,093|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005584 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005824 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005824 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005824 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,094|alice|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:acquire:334| Lock 140509199005584 acquired on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140509199005584 on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095|alice|DEBUG|secretflow|_api.py:release:367| Lock 140509199005584 released on /tmp/ray/session_2024-09-14_10-37-47_714211_7252/ports_by_node.json.lock
2024-09-14 10:37:51,095 INFO worker.py:1724 -- Connected to Ray cluster.
2024-09-14 10:37:51.870 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': 'http://gsid-dwdkvwbe-node-35-0-fed.bob.svc:80', 'alice': '0.0.0.0:21059'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}}
(raylet) [2024-09-14 10:37:52,467 I 9291 9291] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=9291) 2024-09-14 10:37:53.277 INFO link.py:38 [alice] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000,'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=9291) I0914 10:37:53.306789 9291 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=21059.
(SenderReceiverProxyActor pid=9291) W0914 10:37:53.306837 9291 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-09-14 10:37:53.675 INFO barriers.py:465 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-09-14 10:37:53.675 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
2024-09-14 10:37:53.683 WARNING psi.py:361 [alice] -- [Anonymous_job] {'cluster_def': {'nodes': [{'party': 'bob', 'address': 'http://gsid-dwdkvwbe-node-35-0-spu.bob.svc:80', 'listen_address': ''}, {'party': 'alice', 'address': '0.0.0.0:21058', 'listen_address':''}], 'runtime_config': {'protocol': 2, 'field': 3}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}}
(SenderReceiverProxyActor pid=9291) I0914 10:37:53.680885 9513 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240914.103753.9291/id.db and ./rpc_data/rpcz/20240914.103753.9291/time.db
2024-09-14 10:37:55.665 ERROR component.py:1130 [alice] -- [Anonymous_job] eval on domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
ss: "id"
}
attrs {
ss: "id"
}
attrs {
s: "PROTOCOL_RR22"
}
attrs {
b: true
}
attrs {
s: "no"
}
attrs {
is_na: true
}
attrs {
is_na: true
}
attrs {
s: "CURVE_FOURQ"
}
inputs {
name: "alice1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "alice1_1010363635.csv"
party: "alice"
format: "csv"
}
}
inputs {
name: "bob1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "bob1_1907238687.csv"
party: "bob"
format: "csv"
}
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
failed, error <ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
comp_storage.download_file(uri, output_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
impl.download_file(remote_fn, local_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
assert os.path.exists(full_remote_fn)
AssertionError>
2024-09-14 10:37:55.666 INFO api.py:342 [alice] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-14 10:37:55.666 INFO api.py:356 [alice] -- [Anonymous_job] No wait for data sending.
2024-09-14 10:37:55.668 INFO message_queue.py:72 [alice] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-14 10:37:55.669 INFO message_queue.py:72 [alice] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-14 10:37:55.669 INFO api.py:384 [alice] -- [Anonymous_job] Shutdowned rayfed.
2024-09-14 10:37:55.670 WARNING cleanup.py:154 [alice] -- [Anonymous_job] Failed to send ObjectRef(82891771158d68c1fcce2f44215c103cf6cd60270100000001000000) to bob, error: ray::SenderReceiverProxyActor.send() (pid=9291, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc, actor_id=fcce2f44215c103cf6cd602701000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fec182ddde0>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
comp_storage.download_file(uri, output_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
impl.download_file(remote_fn, local_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
assert os.path.exists(full_remote_fn)
AssertionError,upstream_seq_id: 7#0, downstream_seq_id: 9.
2024-09-14 10:37:55.670 INFO cleanup.py:161 [alice] -- [Anonymous_job] Sending error to bob.
Exception in thread DataSendingQueueThread:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 152, in _process_data_sending_task_return
res = ray.get(obj_ref)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::SenderReceiverProxyActor.send() (pid=9291, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc, actor_id=fcce2f44215c103cf6cd602701000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fec182ddde0>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
comp_storage.download_file(uri, output_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
impl.download_file(remote_fn, local_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
assert os.path.exists(full_remote_fn)
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/site-packages/fed/_private/message_queue.py", line 51, in _loop
res = self._msg_handler(message)
File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 47, in <lambda>
lambda msg: self._process_data_sending_task_return(msg),
File "/usr/local/lib/python3.10/site-packages/fed/cleanup.py", line 166, in _process_data_sending_task_return
send(
File "/usr/local/lib/python3.10/site-packages/fed/proxy/barriers.py", line 502, in send
get_global_context().get_cleanup_manager().push_to_sending(
AttributeError: 'NoneType' object has no attribute 'get_cleanup_manager'
(raylet) [2024-09-14 10:37:54,180 I 9514 9514] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
main()
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
res = comp.eval(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
raise e from None
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
ret = self.__eval_callback(ctx=ctx, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py", line 371, in two_party_balanced_psi_eval_fn
download_files(ctx, uri, input_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 847, in download_files
wait(waits)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 213, in wait
reveal([o.device(lambda o: None)(o) for o in objs])
File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
all_object = sfd.get(all_object_refs)
File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
return fed.get(object_refs)
File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
values = ray.get(ray_refs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7678, ip=gsid-dwdkvwbe-node-35-0-global.alice.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
comp_storage.download_file(uri, output_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
impl.download_file(remote_fn, local_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
assert os.path.exists(full_remote_fn)
AssertionError
参考这个文档提供一下双方的pod日志 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed#id6
bob节点下的pod日志
WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
2024-09-14 10:37:46,688|bob|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='gsid-dwdkvwbe-node-35-0-global.bob.svc', ray_node_manager_port=20391, ray_object_manager_port=20392, ray_client_server_port=20393, ray_worker_ports=[], ray_gcs_port=20390)
2024-09-14 10:37:46,694|bob|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at gsid-dwdkvwbe-node-35-0-global.bob.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=gsid-dwdkvwbe-node-35-0-global.bob.svc --port=20390 --node-manager-port=20391 --object-manager-port=20392 --ray-client-server-port=20393
2024-09-14 10:37:50,465|bob|INFO|secretflow|entry.py:start_ray:80| 2024-09-14 10:37:47,288 INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-09-14 10:37:47,288 INFO scripts.py:744 -- Local node IP: gsid-dwdkvwbe-node-35-0-global.bob.svc
2024-09-14 10:37:50,314 SUCC scripts.py:781 -- --------------------
2024-09-14 10:37:50,314 SUCC scripts.py:782 -- Ray runtime started.
2024-09-14 10:37:50,314 SUCC scripts.py:783 -- --------------------
2024-09-14 10:37:50,314 INFO scripts.py:785 -- Next steps
2024-09-14 10:37:50,315 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-09-14 10:37:50,315 INFO scripts.py:791 -- ray start --address='gsid-dwdkvwbe-node-35-0-global.bob.svc:20390'
2024-09-14 10:37:50,315 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-09-14 10:37:50,315 INFO scripts.py:802 -- import ray
2024-09-14 10:37:50,315 INFO scripts.py:803 -- ray.init(_node_ip_address='gsid-dwdkvwbe-node-35-0-global.bob.svc')
2024-09-14 10:37:50,315 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-09-14 10:37:50,315 INFO scripts.py:835 -- ray stop
2024-09-14 10:37:50,315 INFO scripts.py:838 -- To view the status of the cluster, use
2024-09-14 10:37:50,315 INFO scripts.py:839 -- ray status
2024-09-14 10:37:50,465|bob|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at gsid-dwdkvwbe-node-35-0-global.bob.svc.
2024-09-14 10:37:50,470|bob|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param {
"domain": "data_prep",
"name": "psi",
"version": "0.0.5",
"attrPaths": [
"input/receiver_input/key",
"input/sender_input/key",
"protocol",
"sort_result",
"allow_duplicate_keys",
"allow_duplicate_keys/no/skip_duplicates_check",
"fill_value_int",
"ecdh_curve"
],
"attrs": [
{
"ss": [
"id"
]
},
{
"ss": [
"id"
]
},
{
"s": "PROTOCOL_RR22"
},
{
"b": true
},
{
"s": "no"
},
{
"isNa": true
},
{
"isNa": true
},
{
"s": "CURVE_FOURQ"
}
],
"inputs": [
{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"lineCount": "-1"
},
"dataRefs": [
{
"uri": "alice1_1010363635.csv",
"party": "alice",
"format": "csv"
}
]
},
{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"lineCount": "-1"
},
"dataRefs": [
{
"uri": "bob1_1907238687.csv",
"party": "bob",
"format": "csv"
}
]
}
],
"checkpointUri": "ckgsid-dwdkvwbe-node-35-output-0"
}
2024-09-14 10:37:50,482|bob|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:50,482|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id astrqxxq to
...........
name: "alice1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "alice1_1010363635.csv"
party: "alice"
format: "csv"
}
....
2024-09-14 10:37:50,492|bob|WARNING|secretflow|meta_conversion.py:convert_domain_data_to_individual_table:29| kuscia adapter has to deduce dist data from domain data at this moment.
2024-09-14 10:37:50,492|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id yxcxhdat to
...........
name: "bob1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "bob1_1907238687.csv"
party: "bob"
format: "csv"
}
....
2024-09-14 10:37:50,492|bob|WARNING|secretflow|entry.py:comp_eval:169|
--
Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7
--
2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:170|
--
*param*
domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
ss: "id"
}
attrs {
ss: "id"
}
attrs {
s: "PROTOCOL_RR22"
}
attrs {
b: true
}
attrs {
s: "no"
}
attrs {
is_na: true
}
attrs {
is_na: true
}
attrs {
s: "CURVE_FOURQ"
}
inputs {
name: "alice1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "alice1_1010363635.csv"
party: "alice"
format: "csv"
}
}
inputs {
name: "bob1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "bob1_1907238687.csv"
party: "bob"
format: "csv"
}
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
--
2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:171|
--
*storage_config*
type: "local_fs"
local_fs {
wd: "/home/kuscia/var/storage/data"
}
--
2024-09-14 10:37:50,493|bob|WARNING|secretflow|entry.py:comp_eval:172|
--
*cluster_config*
desc {
parties: "bob"
parties: "alice"
devices {
name: "spu"
type: "spu"
parties: "bob"
parties: "alice"
config: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
}
devices {
name: "heu"
type: "heu"
parties: "bob"
parties: "alice"
config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
}
ray_fed_config {
cross_silo_comm_backend: "brpc_link"
}
}
public_config {
ray_fed_config {
parties: "bob"
parties: "alice"
addresses: "0.0.0.0:20395"
addresses: "gsid-dwdkvwbe-node-35-0-fed.alice.svc:80"
}
spu_configs {
name: "spu"
parties: "bob"
parties: "alice"
addresses: "0.0.0.0:20394"
addresses: "http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80"
}
}
private_config {
self_party: "bob"
ray_head_addr: "gsid-dwdkvwbe-node-35-0-global.bob.svc:20390"
}
--
2024-09-14 10:37:50,495|bob|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-09-14 10:37:50,496 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: gsid-dwdkvwbe-node-35-0-global.bob.svc:20390...
2024-09-14 10:37:50,508|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734048 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734048 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734048 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,509|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734048 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/node_ip_address.json.lock
2024-09-14 10:37:50,513|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734144 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734144 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,514|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971733904 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971733904 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971734144 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971734144 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,515|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971734144 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140454971733904 acquired on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140454971733904 on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516|bob|DEBUG|secretflow|_api.py:release:367| Lock 140454971733904 released on /tmp/ray/session_2024-09-14_10-37-47_289284_7158/ports_by_node.json.lock
2024-09-14 10:37:50,516 INFO worker.py:1724 -- Connected to Ray cluster.
2024-09-14 10:37:51.327 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': '0.0.0.0:20395', 'alice': 'http://gsid-dwdkvwbe-node-35-0-fed.alice.svc:80'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}}
(raylet) [2024-09-14 10:37:51,273 I 7581 7581] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=9199) 2024-09-14 10:37:52.620 INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=9199) I0914 10:37:52.646880 9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=20395.
(SenderReceiverProxyActor pid=9199) W0914 10:37:52.646909 9199 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
(SenderReceiverProxyActor pid=9199) I0914 10:37:53.321158 9421 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240914.103753.9199/id.db and ./rpc_data/rpcz/20240914.103753.9199/time.db
2024-09-14 10:37:53.676 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-09-14 10:37:53.676 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
2024-09-14 10:37:53.685 WARNING psi.py:361 [bob] -- [Anonymous_job] {'cluster_def': {'nodes': [{'party': 'bob', 'address': '0.0.0.0:20394', 'listen_address': ''}, {'party': 'alice', 'address': 'http://gsid-dwdkvwbe-node-35-0-spu.alice.svc:80', 'listen_address':''}], 'runtime_config': {'protocol': 2, 'field': 3}}, 'link_desc': {'connect_retry_times': 60, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'recv_timeout_ms': 1200000, 'http_timeout_ms': 1200000}}
2024-09-14 10:37:55.340 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain: "data_prep"
name: "psi"
version: "0.0.5"
attr_paths: "input/receiver_input/key"
attr_paths: "input/sender_input/key"
attr_paths: "protocol"
attr_paths: "sort_result"
attr_paths: "allow_duplicate_keys"
attr_paths: "allow_duplicate_keys/no/skip_duplicates_check"
attr_paths: "fill_value_int"
attr_paths: "ecdh_curve"
attrs {
ss: "id"
}
attrs {
ss: "id"
}
attrs {
s: "PROTOCOL_RR22"
}
attrs {
b: true
}
attrs {
s: "no"
}
attrs {
is_na: true
}
attrs {
is_na: true
}
attrs {
s: "CURVE_FOURQ"
}
inputs {
name: "alice1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "alice1_1010363635.csv"
party: "alice"
format: "csv"
}
}
inputs {
name: "bob1"
type: "sf.table.individual"
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.IndividualTable"
value: "\n\t\022\002id*\003int\020\377\377\377\377\377\377\377\377\377\001"
}
data_refs {
uri: "bob1_1907238687.csv"
party: "bob"
format: "csv"
}
}
output_uris: "gsid-dwdkvwbe-node-35-output-0"
checkpoint_uri: "ckgsid-dwdkvwbe-node-35-output-0"
failed, error <ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
comp_storage.download_file(uri, output_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
impl.download_file(remote_fn, local_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
assert os.path.exists(full_remote_fn)
AssertionError>
2024-09-14 10:37:55.341 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-09-14 10:37:55.341 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.
2024-09-14 10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-09-14 10:37:55.342 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-09-14 10:37:55.342 INFO api.py:384 [bob] -- [Anonymous_job] Shutdowned rayfed.
(raylet) [2024-09-14 10:37:54,186 I 9422 9422] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
main()
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
res = comp.eval(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
raise e from None
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
ret = self.__eval_callback(ctx=ctx, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/psi.py", line 371, in two_party_balanced_psi_eval_fn
download_files(ctx, uri, input_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 847, in download_files
wait(waits)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 213, in wait
reveal([o.device(lambda o: None)(o) for o in objs])
File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
all_object = sfd.get(all_object_refs)
File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
return fed.get(object_refs)
File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
values = ray.get(ray_refs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=7577, ip=gsid-dwdkvwbe-node-35-0-global.bob.svc)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 839, in download_file
comp_storage.download_file(uri, output_path)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 32, in download_file
impl.download_file(remote_fn, local_fn)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 171, in download_file
assert os.path.exists(full_remote_fn)
AssertionError
显示实际的物理文件找不到,如果是自定义的用户数据,需要把物理文件分别放到 alice/bob 节点的 /home/kuscia/var/storage/data 目录下 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#id11
显示实际的物理文件找不到,如果是自定义的用户数据,需要把物理文件分别放到 alice/bob 节点的 /home/kuscia/var/storage/data 目录下 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.9.0b0/deployment/K8s_deployment_kuscia/K8s_master_lite_cn#id11
我是用pad前端页面上传的数据源,咋还需要准备数据这一步啊
kuscia 是 k8s 部署吗,当前你部署的方式 secretpad 是如何跟 k8s 部署的 kuscia 交互的
kuscia 是 k8s 部署吗,当前你部署的方式 secretpad 是如何跟 k8s 部署的 kuscia 交互的
kuscia 是 k8s 部署的,secretpad是源码打的镜像k8s部署的,部署同一个环境中,下面是secretpad的配置文件
server:
tomcat:
accesslog:
enabled: true
directory: /var/log/secretpad
servlet:
session:
timeout: 30m
http-port: 8080
http-port-inner: 9001
port: 443
ssl:
enabled: true
key-store: "file:./config/server.jks"
key-store-password: ${KEY_PASSWORD:secretpad}
key-alias: secretpad-server
key-password: ${KEY_PASSWORD:secretpad}
key-store-type: JKS
compression:
enabled: true
mime-types:
- application/javascript
- text/css
min-response-size: 1024
spring:
task:
scheduling:
pool:
size: 10
application:
name: secretpad
jpa:
database-platform: org.hibernate.community.dialect.SQLiteDialect
show-sql: false
properties:
hibernate:
format_sql: false
open-in-view: false
datasource:
driver-class-name: org.sqlite.JDBC
url: jdbc:sqlite:./db/secretpad.sqlite
hikari:
idle-timeout: 60000
maximum-pool-size: 1
connection-timeout: 6000
flyway:
baseline-on-migrate: true
locations:
- filesystem:./config/schema/center
#datasource used for mysql
#spring:
# task:
# scheduling:
# pool:
# size: 10
# application:
# name: secretpad
# jpa:
# database-platform: org.hibernate.dialect.MySQLDialect
# show-sql: false
# properties:
# hibernate:
# format_sql: false
# datasource:
# driver-class-name: com.mysql.cj.jdbc.Driver
# url: your mysql url
# username:
# password:
# hikari:
# idle-timeout: 60000
# maximum-pool-size: 10
# connection-timeout: 5000
jackson:
deserialization:
fail-on-missing-external-type-id-property: false
fail-on-ignored-properties: false
fail-on-unknown-properties: false
serialization:
fail-on-empty-beans: false
web:
locale: zh_CN # default locale, overridden by request "Accept-Language" header.
cache:
jcache:
config:
classpath:ehcache.xml
springdoc:
api-docs:
enabled: true
management:
endpoints:
web:
exposure:
include: health,info,readiness,prometheus
enabled-by-default: false
kusciaapi:
protocol: ${KUSCIA_PROTOCOL:notls}
kuscia:
nodes:
- domainId: kuscia-system
mode: master
host: ${KUSCIA_API_ADDRESS:kuscia-master.data-develop-operate-dev.svc.cluster.local}
port: ${KUSCIA_API_PORT:8083}
protocol: ${KUSCIA_PROTOCOL:notls}
cert-file: config/certs/client.crt
key-file: config/certs/client.pem
token: config/certs/token
- domainId: alice
mode: lite
host: ${KUSCIA_API_LITE_ALICE_ADDRESS:kuscia-lite-alice.data-develop-operate-dev.svc.cluster.local}
port: ${KUSCIA_API_PORT:8083}
protocol: ${KUSCIA_PROTOCOL:notls}
cert-file: config/certs/alice/client.crt
key-file: config/certs/alice/client.pem
token: config/certs/alice/token
- domainId: bob
mode: lite
host: ${KUSCIA_API_LITE_BOB_ADDRESS:kuscia-lite-bob.data-develop-operate-dev.svc.cluster.local}
port: ${KUSCIA_API_PORT:8083}
protocol: ${KUSCIA_PROTOCOL:notls}
cert-file: config/certs/bob/client.crt
key-file: config/certs/bob/client.pem
token: config/certs/bob/token
job:
max-parallelism: 1
secretpad:
logs:
path: ${SECRETPAD_LOG_PATH:../log}
deploy-mode: ${DEPLOY_MODE:ALL-IN-ONE} # MPC TEE ALL-IN-ONE
platform-type: CENTER
node-id: kuscia-system
center-platform-service: secretpad.master.svc
gateway: ${KUSCIA_GW_ADDRESS:127.0.0.1:80}
auth:
enabled: true
pad_name: ${SECRETPAD_USER_NAME}
pad_pwd: ${SECRETPAD_PASSWORD}
response:
extra-headers:
Content-Security-Policy: "base-uri 'self';frame-src 'self';worker-src blob: 'self' data:;object-src 'self';"
upload-file:
max-file-size: -1 # -1 means not limit, e.g. 200MB, 1GB
max-request-size: -1 # -1 means not limit, e.g. 200MB, 1GB
data:
dir-path: /app/data/
datasync:
center: true
p2p: false
version:
secretpad-image: ${SECRETPAD_IMAGE:0.5.0b0}
kuscia-image: ${KUSCIA_IMAGE:0.6.0b0}
secretflow-image: ${SECRETFLOW_IMAGE:1.4.0b0}
secretflow-serving-image: ${SECRETFLOW_SERVING_IMAGE:0.2.0b0}
tee-app-image: ${TEE_APP_IMAGE:0.1.0b0}
tee-dm-image: ${TEE_DM_IMAGE:0.1.0b0}
capsule-manager-sim-image: ${CAPSULE_MANAGER_SIM_IMAGE:0.1.2b0}
component:
hide:
- secretflow/io/read_data:0.0.1
- secretflow/io/write_data:0.0.1
- secretflow/io/identity:0.0.1
- secretflow/model/model_export:0.0.1
- secretflow/ml.train/slnn_train:0.0.1
- secretflow/ml.predict/slnn_predict:0.0.2
sfclusterDesc:
deviceConfig:
spu: "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
heu: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
rayFedConfig:
crossSiloCommBackend: "brpc_link"
tee:
capsule-manager: capsule-manager.#.svc
data:
sync:
- org.secretflow.secretpad.persistence.entity.ProjectDO
- org.secretflow.secretpad.persistence.entity.ProjectNodeDO
- org.secretflow.secretpad.persistence.entity.NodeDO
- org.secretflow.secretpad.persistence.entity.NodeRouteDO
- org.secretflow.secretpad.persistence.entity.ProjectJobDO
- org.secretflow.secretpad.persistence.entity.ProjectTaskDO
- org.secretflow.secretpad.persistence.entity.ProjectDatatableDO
- org.secretflow.secretpad.persistence.entity.VoteRequestDO
- org.secretflow.secretpad.persistence.entity.VoteInviteDO
- org.secretflow.secretpad.persistence.entity.TeeDownLoadAuditConfigDO
- org.secretflow.secretpad.persistence.entity.NodeRouteApprovalConfigDO
- org.secretflow.secretpad.persistence.entity.TeeNodeDatatableManagementDO
- org.secretflow.secretpad.persistence.entity.ProjectModelServingDO
- org.secretflow.secretpad.persistence.entity.ProjectGraphNodeKusciaParamsDO
- org.secretflow.secretpad.persistence.entity.ProjectModelPackDO
- org.secretflow.secretpad.persistence.entity.FeatureTableDO
- org.secretflow.secretpad.persistence.entity.ProjectFeatureTableDO
- org.secretflow.secretpad.persistence.entity.ProjectGraphDomainDatasourceDO
inner-port:
path:
- /api/v1alpha1/vote_sync/create
- /api/v1alpha1/user/node/resetPassword
- /sync
- /api/v1alpha1/data/sync
# ip block config (None of them are allowed in the configured IP list)
ip:
block:
enable: true
list:
- 0.0.0.0/32
- 127.0.0.1/8
- 10.0.0.0/8
- 11.0.0.0/8
- 30.0.0.0/8
- 100.64.0.0/10
- 172.16.0.0/12
- 192.168.0.0/16
- 33.0.0.0/8
docker 部署的 secretpad + kuscia 通过挂载统一个数据目录实现 secretpad 和 kuscia 数据共享。k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,k8s 方式推荐 oss 数据源
docker 部署的 secretpad + kuscia 通过挂载统一个数据目录实现 secretpad 和 kuscia 数据共享。k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,k8s 方式推荐 oss 数据源
好的,k8s 部署当前需要同样的挂载统一个 volume实现相同的功能,这个方法有文档吗
目前没有,可以找一下 k8s 部署文档
文件不存在,这个文件要放到哪里啊