FATE
FATE copied to clipboard
关于在1.7.2集群版中训练hetero_secure_boost的一个问题
Describe the bug 在1.7.2集群中,按照如下四种配置去训练hetero_secure_boost: 样本量 特征量 样本量 特征量
- guest: 15W 2000 host: 15W 2000
- guest: 15W 3000 host: 15W 3000
- guest 25W 1000 host: 25W 4000
- guest 25W 10 host: 25W 5000
前面的各个组件(数据读取、转换、求交)都成功了,但是在模型训练时,却总是失败。
logs文件夹下会生成一个fate_flow_detect.log文件,如下图所示:
提示模型训练的某个子进程不存在,导致训练任务被杀掉了,具体内容如下:
[INFO] [2022-08-09 15:21:50,924] [202208091448139457690] [20:140055652448000] - [detector.detect_r
unning_task] [line:57]: task 202208091448139457690_hetero_secure_boost_0 1 on guest 9999 with runn
ing process 25545 does not exist
[INFO] [2022-08-09 15:21:53,932] [202208091448139457690] [20:140055652448000] - [detector.detect_r
unning_task] [line:65]: task 202208091448139457690_hetero_secure_boost_0 1 on guest 9999 party sta
tus has changed to failed, may be stopped by task_controller.stop_task, pass stop job again
由于以上原因,总是导致job失败,但我不知道是因为什么导致模型训练的子进程没掉了。 而在这之前,我对guest和host分别以25W10维和25W3000维、15W1000维和15W1000维的数据去训练模型,运行就没问题。
综上所述,当我提高特征量去跑hetero_secure_boost时,总是会由于子进程不存在而失败,请教一下可能的原因是什么? To Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: ubuntu18.04
- Browser [e.g. chrome, safari]
- Version [e.g. 22]
Smartphone (please complete the following information):
- Device: [e.g. iPhone6]
- OS: [e.g. iOS8.1]
- Browser [e.g. stock browser, safari]
- Version [e.g. 22]
Additional context Add any other context about the problem here.