KubeFATE icon indicating copy to clipboard operation
KubeFATE copied to clipboard

Failure of job submit

Open Jason-wwww opened this issue 3 years ago • 12 comments

**What deployment mode you are use? **

  1. docker-compose;

**What KubeFATE and FATE version you are using? ** 1.9.0 MUST Please state the KubeFATE and FATE version you found the issue 1.9.0 **What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. **

  • OS: ubuntu

To Reproduce

Refer to https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README.md flow job submit -d fateflow/examples/lr/test_hetero_lr_job_dsl.json -c fateflow/examples/lr/test_hetero_lr_job_conf.json

What happen? Get error of job submit: { "jobId": "202210210745344412160", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {9999: {'data': {'job_id': '202210210745344412160'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB * conf/service_conf.yaml#nodes * conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with eggroll_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'host': {10000: {'data': {'job_id': '202210210745344412160'}, 'retcode': 103, 'retmsg': 'max cores per job is 4 base on (fate_flow/settings#MAX_CORES_PERCENT_PER_JOB * conf/service_conf.yaml#nodes * conf/service_conf.yaml#cores_per_node), expect 8 cores, please use task_cores job parameters to set request task cores or you can customize it with eggroll_run job parameters, default value is fate_flow/settings.py#DEFAULT_TASK_CORES_PER_NODE, refer fate_flow/examples/simple/simple_job_conf.json'}}, 'arbiter': {10000: {'data': {'components': {'data_transform_0': {'need_run': False}, 'evaluation_0': {'need_run': True}, 'hetero_feature_binning_0': {'need_run': False}, 'hetero_feature_selection_0': {'need_run': False}, 'hetero_lr_0': {'need_run': True}, 'intersection_0': {'need_run': False}, 'reader_0': {'need_run': False}}}, 'retcode': 0, 'retmsg': 'success'}}})\n" }

Jason-wwww avatar Oct 21 '22 07:10 Jason-wwww

The job you perform has resource requirements, and you need to allocate more resources to FATE. You can modify compute_core=4 in parties.conf to configure more resources for the FATE cluster.

owlet42 avatar Oct 21 '22 08:10 owlet42

The job you perform has resource requirements, and you need to allocate more resources to FATE. You can modify compute_core=4 in parties.conf to configure more resources for the FATE cluster.

After modified compute_core on host, did I need to docker-compose restart ?

Jason-wwww avatar Oct 21 '22 09:10 Jason-wwww

You need to clean up everything and re-deploy.

JingChen23 avatar Oct 24 '22 01:10 JingChen23

You need to clean up everything and re-deploy.

Thsnks, it works after re-deploy and re-submit the job. While checking the status flow task query -r guest -j 202111230933232084530 | grep -w f_status, I got a failure: image How can I get some log about this failure and how to solve it ?

Jason-wwww avatar Oct 24 '22 06:10 Jason-wwww

This means that one of the tasks of your job failed.

you need to do "docker exec -it bash".

Then check the log directory in the fateflow directory.

There should be a directory named by the job id. Pay attention to the error logs.

JingChen23 avatar Oct 24 '22 07:10 JingChen23

This means that one of the tasks of your job failed.

you need to do "docker exec -it bash".

Then check the log directory in the fateflow directory.

There should be a directory named by the job id. Pay attention to the error logs.

Hi, I can't find the log directory under the fateflow folder. there is only one examples folder under /data/projects/fate/fateflow.

Jason-wwww avatar Oct 24 '22 07:10 Jason-wwww

This is the output of job submit: { "data": { "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202210240645313003060&role=guest&party_id=9999", "code": 0, "dsl_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/job_dsl.json", "job_id": "202210240645313003060", "logs_directory": "/data/projects/fate/fateflow/logs/202210240645313003060", "message": "success", "model_info": { "model_id": "arbiter-10000#guest-9999#host-10000#model", "model_version": "202210240645313003060" }, "pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/pipeline_dsl.json", "runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/guest/9999/job_runtime_on_party_conf.json", "runtime_conf_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/job_runtime_conf.json", "train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202210240645313003060/train_runtime_conf.json" }, "jobId": "202210240645313003060", "retcode": 0, "retmsg": "success" }

I can see the "logs_directory": "/data/projects/fate/fateflow/logs/202210240645313003060", however I can't find the directory.

Jason-wwww avatar Oct 24 '22 07:10 Jason-wwww

docker exec -it "your fateflow container id" bash

The log is inside the container.

JingChen23 avatar Oct 24 '22 08:10 JingChen23

docker exec -it "your fateflow container id" bash

The log is inside the container.

Yes, I find the log file in container. But I can't find it as I have mentioned above.

Jason-wwww avatar Oct 26 '22 05:10 Jason-wwww

"/data/projects/fate/fateflow/logs/202210240645313003060" This directory is not in the fateflow container?

JingChen23 avatar Oct 28 '22 03:10 JingChen23

i am getting the same error as mention in the starting except i have deployed kubefate through kubernetes. Same version of kubefate. Please help me out. I am stuck for long time.

Mansi2487 avatar Apr 18 '23 02:04 Mansi2487

@Mansi2487 Give spark-worker or nodemanager more resources:

  # resources:
    # requests:
      # cpu: "2"
      # memory: "4Gi"
    # limits:
      # cpu: "4"
      # memory: "8Gi"

owlet42 avatar Apr 18 '23 05:04 owlet42