pai icon indicating copy to clipboard operation
pai copied to clipboard

Job initialization port conflict retry

Open JosephKang opened this issue 3 years ago • 3 comments

Organization Name: Advantech

Short summary about the issue/question: Port conflict error happened during Job initialization phase with infinite retry.

Brief what process you are following:

  1. Create a task with many port definition. The resource is enough.
  2. The job will enter status change flow loop: waiting ->running->error->waiting->running->error->...
  3. Only init.log and barrier.log are observed when jobs stay at running or error status

How to reproduce it: Not sure the trigger condition. It will happened randomly.

OpenPAI Environment:

  • OpenPAI version:
    v1.0.1
  • Cloud provider or hardware configuration: Physical Machines
  • OS (e.g. from /etc/os-release): ubuntu 16.04
  • Kernel (e.g. uname -a): 4.4.0-186-generic
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.): 48Core/256GB MEM/1.8TB DISK/GeForce RTX 2080 Ti
  • Others: a. CPU-only job with k8s default scheduler(Not hived scheduler)

Anything else we need to know: Log files: port_conflict_1_init.log port_conflict_2_init.log

JosephKang avatar Jul 16 '21 01:07 JosephKang

@JosephKang We already fixed this issue though PR: https://github.com/microsoft/pai/issues/4384. Please upgrade PAI to latest version.

Binyang2014 avatar Jul 16 '21 02:07 Binyang2014

We deployed v1.0.1 in some customer site that are unable to upgrade immediately.

May I know how to have a hotfix patch in v1.0.1 version?

JosephKang avatar Jul 20 '21 01:07 JosephKang

This fix involve many components such as openpai-runtime and rest-server. It's not easy to provide the hotfix. Since the port number is random chosen. Restart the job may solve this issue

Binyang2014 avatar Jul 21 '21 01:07 Binyang2014