pai icon indicating copy to clipboard operation
pai copied to clipboard

Installation Issue List

Open Starmys opened this issue 5 years ago • 4 comments

  1. Add / Remove nodes

  2. Installation Enhancement

  3. 5100: Installation script refinement

    • dev-box can be inside master node
    • P3 all in one deployment: single node cluster support
      • allow master node to be worker at the same time;
    • uninstallation doc :
    • ns 'pai-storage' already exists: if quick-start-service.sh fails, the ns may have already been created.

Starmys avatar Feb 25 '21 08:02 Starmys

Issue when exchange worker and master (delete and re-deploy): etcd config conflicts image

Starmys avatar Feb 25 '21 10:02 Starmys

Plan

  1. docker dev box

    1. add into dockerfile
      # basic tools
      apt-get install software-properties-common
      apt-get update
      apt-get install sudo
      # python3.6
      sudo add-apt-repository ppa:deadsnakes/ppa
      sudo apt update
      sudo apt install python3.6
      sudo rm /usr/bin/python3
      sudo ln -s /usr/bin/python3.6 /usr/bin/python3
      sudo pip3 install setuptools
      # ansible, etc.
      The bottom half of pai/contrib/kubespray/script/environment.sh
      
    2. docker run ... -v ~/pai:/root/pai -v ~/pai-deploy:/root/pai-deploy -v ~/.ssh:/root/.ssh
  2. integrate the steps of add / remove nodes into ./paictl scale

    1. before ./paictl scale: modify layout.yml and services-configuration.yaml manually
    2. preparation
      1. check if layout.yaml conflicts with services-configuration.yaml
      2. compare the input layout.yml with /cluster_configuration/layout.yml to fetch the nodelist to add / remove
      3. ./paictl config push -p /udpated-config
      4. modify kubespray
        1. remove-node.yaml: gather_facts: yes
        2. roles/remove-node/post-remove/tasks/main.yml: remove run_once: true
    3. add nodes
      # check docker daemon config
      requirement.sh --limit ...
      # raise a notice to change docker daemon config and reload docker daemon manually if necessary
      # add node to k8s cluster
      cd /root/pai-deploy/kubespray
      ansible-playbook -i inventory/pai/hosts.yml scale.yml --become --become-user=root -e "@inventory/gcrv100/openpai.yml" --limit=nodelist
      # update config and restart service
      cd /root/pai/
      ./paictl service stop -n cluster-configuration rest-server hivedscheduler job-exporter
      ./paictl service start -n cluster-configuration rest-server hivedscheduler job-exporter
      
    4. remove nodes
      # remove node from k8s cluster
      cd /root/pai-deploy/kubespray
      ansible-playbook -i inventory/pai/hosts.yml remove-node.yml --become --become-user=root -e "@inventory/gcrv100/openpai.yml" --limit=nodelist
      # update config and restart service
      cd /root/pai/
      ./paictl service stop -n cluster-configuration rest-server hivedscheduler job-exporter
      ./paictl service start -n cluster-configuration rest-server hivedscheduler job-exporter
      

Starmys avatar Mar 01 '21 09:03 Starmys

related https://github.com/microsoft/pai/issues/2558

fanyangCS avatar Mar 11 '21 11:03 fanyangCS

related https://github.com/microsoft/pai/issues/4521

fanyangCS avatar Mar 11 '21 11:03 fanyangCS