pai
pai copied to clipboard
[v1.6.0] The nodes added after installation do not use the local docker cache by default.
Organization Name:
Short summary about the issue/question: The nodes added after installation do not use the local docker cache by default.
Brief what process you are following:
- install v1.6.0 on master and work-01 (
/etc/docker/daemon.json on
master must be{}
); - add node worker-02;
- open
/etc/docker/daemon.json
.
master:
{"insecure-registries": ["http://master_ip:30500"], "registry-mirrors": ["http://master_ip:30500"]}
work-01:
{"default-runtime": "nvidia", "registry-mirrors": ["http://master_ip:30500"], "runtimes": {"nvidia": {"path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": []}}, "insecure-registries": ["http://master_ip:30500"]}
work-02:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
How to reproduce it:
OpenPAI Environment:
- OpenPAI version: v1.6.0
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): ubuntu 16.04
- Kernel (e.g.
uname -a
): - Hardware (e.g. core number, memory size, storage size, GPU type etc.):
- Others:
Anything else we need to know:
When the daemon is modified according to work-01, all nodes keep outputting logs similar to the following.
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─docker-dns.conf, docker-options.conf
Active: active (running) since Tue 2021-04-27 11:54:04 CST; 1 day 7h ago
Docs: http://docs.docker.com
Main PID: 42962 (dockerd)
Tasks: 0
Memory: 49.2M
CPU: 4.687s
CGroup: /system.slice/docker.service
‣ 42962 /usr/bin/dockerd --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --iptables=false --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --dns 10.192.0.3 --dns 210.34.48.59 --dns 218.85.157.99 --dns-search default.svc.cluster.local --dns-search svc.cluster.local --dns-opt ndots:2 --dns-opt timeout:2 --dns-opt attempts:2
Apr 28 19:16:48 csip-090 dockerd[42962]: time="2021-04-28T19:16:48.748255760+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:17:18 csip-090 dockerd[42962]: time="2021-04-28T19:17:18.816930091+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:17:48 csip-090 dockerd[42962]: time="2021-04-28T19:17:48.901201303+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:18:18 csip-090 dockerd[42962]: time="2021-04-28T19:18:18.974442208+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:18:49 csip-090 dockerd[42962]: time="2021-04-28T19:18:49.045662640+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:19:19 csip-090 dockerd[42962]: time="2021-04-28T19:19:19.119140455+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:19:49 csip-090 dockerd[42962]: time="2021-04-28T19:19:49.203392943+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:20:19 csip-090 dockerd[42962]: time="2021-04-28T19:20:19.277649615+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:20:49 csip-090 dockerd[42962]: time="2021-04-28T19:20:49.361036681+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
Apr 28 19:21:19 csip-090 dockerd[42962]: time="2021-04-28T19:21:19.432572648+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
When the daemon is modified according to work-01, all nodes keep outputting logs similar to the following.
● docker.service - Docker Application Container Engine Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/docker.service.d └─docker-dns.conf, docker-options.conf Active: active (running) since Tue 2021-04-27 11:54:04 CST; 1 day 7h ago Docs: http://docs.docker.com Main PID: 42962 (dockerd) Tasks: 0 Memory: 49.2M CPU: 4.687s CGroup: /system.slice/docker.service ‣ 42962 /usr/bin/dockerd --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --iptables=false --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --dns 10.192.0.3 --dns 210.34.48.59 --dns 218.85.157.99 --dns-search default.svc.cluster.local --dns-search svc.cluster.local --dns-opt ndots:2 --dns-opt timeout:2 --dns-opt attempts:2 Apr 28 19:16:48 csip-090 dockerd[42962]: time="2021-04-28T19:16:48.748255760+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:17:18 csip-090 dockerd[42962]: time="2021-04-28T19:17:18.816930091+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:17:48 csip-090 dockerd[42962]: time="2021-04-28T19:17:48.901201303+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:18:18 csip-090 dockerd[42962]: time="2021-04-28T19:18:18.974442208+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:18:49 csip-090 dockerd[42962]: time="2021-04-28T19:18:49.045662640+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:19:19 csip-090 dockerd[42962]: time="2021-04-28T19:19:19.119140455+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:19:49 csip-090 dockerd[42962]: time="2021-04-28T19:19:49.203392943+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:20:19 csip-090 dockerd[42962]: time="2021-04-28T19:20:19.277649615+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:20:49 csip-090 dockerd[42962]: time="2021-04-28T19:20:49.361036681+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n" Apr 28 19:21:19 csip-090 dockerd[42962]: time="2021-04-28T19:21:19.432572648+08:00" level=warning msg="failed to retrieve /usr/bin/nvidia-container-runtime version: unknown output format: runc version 1.0.0-rc93\ncommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec\nspec: 1.0.2-dev\ngo: go1.13.15\nlibseccomp: 2.5.1\n"
And all jobs remain in the waiting state and cannot proceed to the running state.
@SwordFaith @debuggy Help to take a look?
Temporary solution:
Run this code before adding the node.
cd ~/pai-deploy/kubespray/
ansible-playbook -i ${HOME}/pai-deploy/cluster-cfg/hosts.yml <pai-code-dir>/contrib/kubespray/docker-cache-config-distribute.yml --limit=new-worker-node
Where is the bug:
If enable_docker_cache=True
in config file, add_docker_cache_config.py should be called in change_node.py, but not.
Run this code before adding the node.
cd ~/pai-deploy/kubespray/ ansible-playbook -i ${HOME}/pai-deploy/cluster-cfg/hosts.yml docker-cache-config-distribute.yml --limit=new-worker-node || exit $?
Where is the bug:
If
enable_docker_cache=True
in config file, add_docker_cache_config.py should be called in change_node.py, but not.
So the behavior you expected is after change_node run, docker cache config should distribute to new node automatically?
So the behavior you expected is after change_node run, docker cache config should distribute to new node automatically?
Thank you for your reply. Yes, when enable_docker_cache=True
, this is the desired treatment.
@Binyang2014 It seems when we add enable_docker_cache config there is no option to use change_node.py. And it will be more straight-forward to user that sync new node docker config when change node triggered. Please organize this bug-fix into dev plan.
Add DRI for this week: @hzy46
Add DRI for this week: @hzy46
Thanks, @Binyang2014. There is a related issue, maybe we can consider them together.
https://github.com/microsoft/pai/blob/e5aef0d344f40d1e057f12578c566eb6995896fe/deployment/paiLibrary/paiOrchestration/change_node.py
We need to update this file so that the following command is called before scale.yml
when adding nodes.
ansible-playbook -i ${HOME}/pai-deploy/cluster-cfg/hosts.yml <pai-code-dir>/contrib/kubespray/docker-cache-config-distribute.yml --limit=new-worker-node -e "@${CLUSTER_CONFIG}"
@SwordFaith This should be the most fundamental problem of this issue. If it is convenient for you, please update it, thank you.