FATE icon indicating copy to clipboard operation
FATE copied to clipboard

关于在1.7.2集群版中训练hetero_secure_boost的一个问题

Open Sword-CS opened this issue 3 years ago • 0 comments

Describe the bug 在1.7.2集群中,按照如下四种配置去训练hetero_secure_boost: 样本量 特征量 样本量 特征量

  1. guest: 15W 2000 host: 15W 2000
  2. guest: 15W 3000 host: 15W 3000
  3. guest 25W 1000 host: 25W 4000
  4. guest 25W 10 host: 25W 5000

前面的各个组件(数据读取、转换、求交)都成功了,但是在模型训练时,却总是失败。 logs文件夹下会生成一个fate_flow_detect.log文件,如下图所示: image 提示模型训练的某个子进程不存在,导致训练任务被杀掉了,具体内容如下: [INFO] [2022-08-09 15:21:50,924] [202208091448139457690] [20:140055652448000] - [detector.detect_r unning_task] [line:57]: task 202208091448139457690_hetero_secure_boost_0 1 on guest 9999 with runn ing process 25545 does not exist [INFO] [2022-08-09 15:21:53,932] [202208091448139457690] [20:140055652448000] - [detector.detect_r unning_task] [line:65]: task 202208091448139457690_hetero_secure_boost_0 1 on guest 9999 party sta tus has changed to failed, may be stopped by task_controller.stop_task, pass stop job again

由于以上原因,总是导致job失败,但我不知道是因为什么导致模型训练的子进程没掉了。 而在这之前,我对guest和host分别以25W10维和25W3000维、15W1000维和15W1000维的数据去训练模型,运行就没问题。

综上所述,当我提高特征量去跑hetero_secure_boost时,总是会由于子进程不存在而失败,请教一下可能的原因是什么? To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: ubuntu18.04
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context Add any other context about the problem here.

Sword-CS avatar Aug 10 '22 08:08 Sword-CS