pai
pai copied to clipboard
Installation Issue List
-
Add / Remove nodes
-
5267: Add manual for adding node to existing PAI cluster
- track potential issues for clusters with multiple master/etcd nodes
-
5239: Two tasks failed when removing nodes
- in
remove-node.yaml, setgather_facts: yes
- in
-
5167: Add / Remove Nodes with
layout.yaml- use
./paictl.py node add -n a b&./paictl.py node remove -n a b
- use
-
5267: Add manual for adding node to existing PAI cluster
-
Installation Enhancement
-
5231: Issues of installation from scatch
- add AAD, pylon and marketplace related services in quick_start script;
-
hived-computing-device-envsinservice-configurationsconfig file has changed from arrays to string
- 4802: Install openpai on existing k8s cluster
- 5152: Enhancement in quick-start installation and upgrade 3rd components
- 4680: Kubespray quick start enhancement
-
5231: Issues of installation from scatch
-
5100: Installation script refinement
- dev-box can be inside master node
- P3 all in one deployment: single node cluster support
- allow master node to be worker at the same time;
- uninstallation doc :
- ns 'pai-storage' already exists: if quick-start-service.sh fails, the ns may have already been created.
Issue when exchange worker and master (delete and re-deploy): etcd config conflicts

Plan
-
docker dev box
- add into dockerfile
# basic tools apt-get install software-properties-common apt-get update apt-get install sudo # python3.6 sudo add-apt-repository ppa:deadsnakes/ppa sudo apt update sudo apt install python3.6 sudo rm /usr/bin/python3 sudo ln -s /usr/bin/python3.6 /usr/bin/python3 sudo pip3 install setuptools # ansible, etc. The bottom half of pai/contrib/kubespray/script/environment.sh -
docker run ... -v ~/pai:/root/pai -v ~/pai-deploy:/root/pai-deploy -v ~/.ssh:/root/.ssh
- add into dockerfile
-
integrate the steps of add / remove nodes into
./paictl scale- before
./paictl scale: modifylayout.ymlandservices-configuration.yamlmanually - preparation
- check if
layout.yamlconflicts withservices-configuration.yaml - compare the input
layout.ymlwith/cluster_configuration/layout.ymlto fetch the nodelist to add / remove -
./paictl config push -p /udpated-config - modify kubespray
-
remove-node.yaml:gather_facts: yes -
roles/remove-node/post-remove/tasks/main.yml: removerun_once: true
-
- check if
- add nodes
# check docker daemon config requirement.sh --limit ... # raise a notice to change docker daemon config and reload docker daemon manually if necessary # add node to k8s cluster cd /root/pai-deploy/kubespray ansible-playbook -i inventory/pai/hosts.yml scale.yml --become --become-user=root -e "@inventory/gcrv100/openpai.yml" --limit=nodelist # update config and restart service cd /root/pai/ ./paictl service stop -n cluster-configuration rest-server hivedscheduler job-exporter ./paictl service start -n cluster-configuration rest-server hivedscheduler job-exporter - remove nodes
# remove node from k8s cluster cd /root/pai-deploy/kubespray ansible-playbook -i inventory/pai/hosts.yml remove-node.yml --become --become-user=root -e "@inventory/gcrv100/openpai.yml" --limit=nodelist # update config and restart service cd /root/pai/ ./paictl service stop -n cluster-configuration rest-server hivedscheduler job-exporter ./paictl service start -n cluster-configuration rest-server hivedscheduler job-exporter
- before
related https://github.com/microsoft/pai/issues/2558
related https://github.com/microsoft/pai/issues/4521