pai
pai copied to clipboard
Job initialization port conflict retry
Organization Name: Advantech
Short summary about the issue/question: Port conflict error happened during Job initialization phase with infinite retry.
Brief what process you are following:
- Create a task with many port definition. The resource is enough.
- The job will enter status change flow loop: waiting ->running->error->waiting->running->error->...
- Only init.log and barrier.log are observed when jobs stay at running or error status
How to reproduce it: Not sure the trigger condition. It will happened randomly.
OpenPAI Environment:
- OpenPAI version:
v1.0.1 - Cloud provider or hardware configuration: Physical Machines
- OS (e.g. from /etc/os-release): ubuntu 16.04
- Kernel (e.g.
uname -a
): 4.4.0-186-generic - Hardware (e.g. core number, memory size, storage size, GPU type etc.): 48Core/256GB MEM/1.8TB DISK/GeForce RTX 2080 Ti
- Others: a. CPU-only job with k8s default scheduler(Not hived scheduler)
Anything else we need to know: Log files: port_conflict_1_init.log port_conflict_2_init.log
@JosephKang We already fixed this issue though PR: https://github.com/microsoft/pai/issues/4384. Please upgrade PAI to latest version.
We deployed v1.0.1 in some customer site that are unable to upgrade immediately.
May I know how to have a hotfix patch in v1.0.1 version?
This fix involve many components such as openpai-runtime and rest-server. It's not easy to provide the hotfix. Since the port number is random chosen. Restart the job may solve this issue