Docker多机部署点对点集群,运行任务时报错
Issue Type
Running
Search for existing issues similar to yours
Yes
OS Platform and Distribution
Linux Ubuntu 22.04
Kuscia Version
kuscia v0.11.0b0
Deployment
docker
deployment Version
docker 24.0.5
App Running type
secretflow
App Running version
secretflow-lite-anolis8:1.7.0b0
Configuration file used to run kuscia.
#alice-autonomy(根据文档的流程使用的是默认的配置)
mode: autonomy
domainID: alice
domainKeyData: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFb2dJQkFBS0NBUUVBcWRSUkpRK1JkRUxRa2ppazBiVllhWWZ4T1l6a3I4WXJ4SE42NDhHNW9nNnlVSDFsClNzVm8vaGgwRmdxSmROZFB5Mk9qMXNGS1V4YmFqT1BFblUzaWNQV2w5RVZoTFJEMWJFajM4Z1gva1N6UUtQeC8KNkFaNFYwaXdaK2llRnA5d0xTMXBnMmxsOVZwK2pKK0pTNUkyOG9hNVQycW90SXVLVit6WkNFVnpTWFc0MHpaSwpicStmeGZUdzFTUEZpcWtRM052d2lhcGpmZDFpY1VqdFNMVW9FVm5KQ1ZmdGpHTlhpM0p5QngvdXAwTzZHTmRFCjd1VTU3OXFIeTJZbXZtNDUwUXRLNG40TEhjS2lMdzhBS2hEaWJMbGtVRWhpOVhUYXFMVWpNbVpIczVWM3pXOE4KUElpaGhkbGN5b1ZoVzZrdFNSRE10YU90UklvY3pHU21jdWlwNndJREFRQUJBb0lCQUhadFpVeUh4N0dnS2h2ZApQaW95NEgxdTIrdDY4Ym9WWWszekRYNG5pSkNXMlFmQitkR2pTZXp2Rm55TVNvQmM2UHIyOTdoNVA2QWpicklTCjN2ZW02VUpHT3J6VmFNZHBiUXRlOHZBbCtLcSs2a1c2bG1NeHA5ZU9DOTNaMit3QXNOUUFOL1Q0bWEzM3RnblAKOG9qdFpEM0pidzRQWGFmUkt0N1hmaDBEZVRwK3BvTllSSnF0UXlXZVQxU2g5cnMvUEZ6djVuUnYyOGU3VFcycApvTUhmb3BySFZJQ3p4YlprbnlsSmtqUEUydVloYXlTNE5oaUxhTXpzYlpZekwzUjA3VndrMnpIb1U4dVJEWHNxCjNmYWJ2OGpLWEJwcUFHNG9SS2ZXWS83blk1ejdDcGhHWklsSk1xOW5ZalQreXdMR3NIU2N1ZEcweUJlNU5rL1AKYXFjc1dra0NnWUVBd1JlZ2NRaWNwaXRiYlBPVnNnSC9zem43ajc5RTJYSjVqcFNIalFIWUhub2xlRzBqbDYxOApsUVcxbWhkazlLbjZkV3MzSFd5enp4V0VsME9IMlY2aENBczBqbCt6eFpOVHNGNTdIajVueDc1RWFMYWJRdlFxCnA2dXJOTldHS25rdnFkNWREdWlzbk93VkVYM01DSzlIbE5rT2dzY1hGNlc5NXdPOGgvWUhDdDBDZ1lFQTRTaUIKTDJKUGdkZCtEbWR3RnJoUWwrWFdQS2lCU2RvbFJJaWVYUkhpK21XT3Q4MmRaeURlc2kxb040c0tjVzFwaHNteApnT3hSa3d4YzBNV0cyZi9mSEFWekh1ZWlJRjV6UTErTHl3Tkl3K1hsUFhMZUcwVlJWTUtUQ0NtSXpIZWwxc25NCkVGVkFDUUxvc0tvbDRMMUVKdW5LL0xNaithVlBpTko5TlVGUVIyY0NnWUJWSTdiUndFdGFGYW9GVzA0NUpCcDgKQzJmNWxRdWxtWTB4cWhvdXVZNXl1Y2NGMTVHbkVvN3BJcEJWZGxWRWNDS0lYWkw2dlhCM01mUzV3Y1FIdTJyagpvaFUxWmN0ZHBiMXorZVR0aS9TMHBSZUMyR21UVnhmcndJMElDZEpUcmdXdkwrWDJhZSthYlpwSWtTQkRBQTVlCittb2tqZWFIdmNRRE5hbU9oWlBMWFFLQmdFdmc0SmhkWXpuNHEweWpZMHprMUpROEtwVEtuTGVNd3A1MEJCcU4KV3BiVC91TEdjbE04Nm8vVmFaZStUY2luL0xZbDVxSHlBaE95U04wNmxCV0hlMkx3R3puQkNnd3FpR0dlSTNoSgpKUTZQdlUrV0ZHL1FUblpvRkREZC9uSVpxRlBZTWVNWE43dFJ0YVZEMGZ3SkRKeW9rWFhUMFQzaWpna29GbllLCkNzbmxBb0dBVDJOZUJYRHNXTmRUelpmR2xyRmJoRmtaNTBuaStjOThLb2JIV0JtZTBsZDdjMFNsdXhQTzN0YmgKaXhPUW1JTWc3YmVPZ0wwV1dIQkEzb1dWaWRRMkRvWEpkYUpHMDc1UUNrRE10MzZLWHdWZHVFTkZ2Qmg2WlY4egp1cjA2cmZWZ3pYR2lJamZkT29BS3BHQ2xudjIrTWdWbk1PMHFqeUcxYTJ6T1d2cVI3Um89Ci0tLS0tRU5EIFJTQSBQUklWQVRFIEtFWS0tLS0tCg==
logLevel: INFO
runtime: runc
runk:
namespace: ""
dnsServers: []
kubeconfigFile: ""
capacity:
cpu: ""
memory: ""
pods: ""
storage: ""
reservedResources:
cpu: ""
memory: ""
image:
pullPolicy: ""
defaultRegistry: ""
registries: []
datastoreEndpoint: ""
What happend and What you expected to happen.
文档:https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/Docker_deployment_kuscia/deploy_p2p_cn
按照其内容执行到最后一步:docker exec -it ${USER}-kuscia-autonomy-alice scripts/user/create_example_job.sh时,查看作业状态发现作业运行失败,经过检查两台机器的网络通信没有问题,也尝试过更换kuscia版本v0.10.0b0和v0.9.0b0。
另外,在两台Ubuntu18.04的机器上执行相同的操作是没有问题的。
Kuscia log output.
state:
terminated:
containerID: containerd://c82832003fc68eb90814d57580589a95fa93107887116cb6fe1271ad7bc62075
exitCode: 1
finishedAt: "2024-08-27T01:54:37Z"
message: |+
WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
main()
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 502, in main
datasource = get_domain_data_source(datasource_stub, datasource_id)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/datamesh.py", line 115, in get_domain_data_source
raise RuntimeError(f"get_domain_data_source failed for {id}: ret = {ret}")
RuntimeError: get_domain_data_source failed for default-data-source: ret = status {
code: 12302
message: "decrypt data source info failed, crypto/rsa: decryption error"
}
reason: Error
startedAt: "2024-08-27T01:54:32Z"
hostIP: 172.18.0.2
phase: Failed
startTime: "2024-08-27T01:54:31Z"
您好,可以参考文档的排查步骤检查下配置。如果还是有问题请提供下双方容器日志。 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed
您好,以下是alice容器的日志,:
2024-08-27T09:54:36.054900253+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
2024-08-27T09:54:37.123971248+08:00 stderr F Traceback (most recent call last):
2024-08-27T09:54:37.124006465+08:00 stderr F File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-08-27T09:54:37.124329442+08:00 stderr F return _run_code(code, main_globals, None,
2024-08-27T09:54:37.124343848+08:00 stderr F File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
2024-08-27T09:54:37.124549198+08:00 stderr F exec(code, run_globals)
2024-08-27T09:54:37.124582165+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in
其他机器可以,有可能是缓存数据影响,可以试一下删除ls /${USER}/kuscia后,重新安装。
Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.