ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Improve versions update for docker building

Open vfdev-5 opened this issue 4 years ago • 10 comments

  • https://github.com/pytorch/ignite/pull/1878#issuecomment-808041919

cc @trsvchn @ydcjeff

vfdev-5 avatar Mar 26 '21 08:03 vfdev-5

@vfdev-5 any idea how can we import yaml inside yaml ? I found out that circleci yaml is just vanilla yaml, so we can't do normally.

Two options I found out currently

  • use pyyaml and generate config.yml just like pytorch and domain repos are using
  • use circleci cli to generate valid config.yml using circleci config pack command

I haven't tried option 1, but have tried option 2: Dir Structure

.circleci/src
├── commands
│   ├── install_dependencies.yml
│   ├── install_latest_nvidia.yml
│   ├── pull_pytorch_stable_devel_image.yml
│   ├── pull_pytorch_stable_image.yml
│   ├── run_pytorch_container.yml
│   └── run_pytorch_devel_container.yml
├── config.yml
├── executors
│   ├── one_gpu.yml
│   ├── one_gpu_windows.yml
│   └── two_gpus.yml
└── jobs
    ├── build_publish_docker_images.yml
    ├── one_gpu_tests.yml
    ├── one_gpu_windows_tests.yml
    ├── two_gpus_check_dist_cifar10_example.yml
    ├── two_gpus_hvd_tests.yml
    └── two_gpus_tests.yml

3 directories, 16 files

The folder names define the names we defined in .circleci/config.yml (jobs is jobs which will contain the jobs we will run, same to commands, executors). What I don't like is this creates many files which only contains a small amount of commands. But what do you think tho ? @vfdev-5 @trsvchn

What's inside above config.yml:

version: 2.1

parameters:
  pytorch_stable_image:
    type: string
    # https://hub.docker.com/r/pytorch/pytorch/tags
    default: "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime"
  pytorch_stable_image_devel:
    type: string
    # https://hub.docker.com/r/pytorch/pytorch/tags
    default: "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel"
  workingdir:
    type: string
    default: "/tmp/ignite"
  should_build_docker_images:
    type: boolean
    default: false
  should_publish_docker_images:
    type: boolean
    default: false
  build_docker_image_pytorch_version:
    type: string
    default: "1.8.1-cuda11.1-cudnn8"
  build_docker_image_hvd_version:
    type: string
    default: "v0.21.3"
  build_docker_image_msdp_version:
    type: string
    default: "v0.3.10"

workflows:
  version: 2
  gpu_tests:
    unless: << pipeline.parameters.should_build_docker_images >>
    jobs:
      - one_gpu_tests
      - one_gpu_windows_tests
      - two_gpus_tests
      - two_gpus_check_dist_cifar10_example
      - two_gpus_hvd_tests
  docker_images:
    when: << pipeline.parameters.should_build_docker_images >>
    jobs:
      - build_publish_docker_images

Here's option 2 output:

commands:
  install_dependencies:
    steps:
      - run:
          command: |
            docker exec -it pthd pip install -r requirements-dev.txt
            export install_apex_cmd='pip install -v --disable-pip-version-check --no-cache-dir git+https://github.com/NVIDIA/apex'
            export install_git_apex_cmd="apt-get update && apt-get install -y --no-install-recommends git && ${install_apex_cmd}"
            docker exec -it pthd /bin/bash -c "$install_git_apex_cmd"
            export install_ignite_cmd='python setup.py install'
            docker exec -it pthd /bin/bash -c "$install_ignite_cmd"
          name: Install dependencies
  install_latest_nvidia:
    steps:
      - run:
          command: |
            sudo apt-get purge nvidia* && sudo apt-get autoremove
            sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-455 cuda-drivers-455
            # Install nvidia-container-runtime
            sudo apt-get install -y nvidia-container-runtime
            # Reload driver : https://stackoverflow.com/a/45319156/6309199
            # lsof | grep nvidia -> kill Xvfb
            sudo lsof | grep "/usr/bin/Xvfb" | head -1 | awk '{print $2}' | xargs -I {} sudo kill -9 {}
            # lsmod | grep nvidia
            sudo rmmod nvidia_uvm && sudo rmmod nvidia_drm && sudo rmmod nvidia_modeset && sudo rmmod nvidia
            # reload driver
            nvidia-smi
          name: Install latest NVidia-driver and CUDA
  pull_pytorch_stable_devel_image:
    steps:
      - run:
          command: |
            docker pull << pipeline.parameters.pytorch_stable_image_devel >>
          name: Pull PyTorch Stable Develop Image
  pull_pytorch_stable_image:
    steps:
      - run:
          command: |
            docker pull << pipeline.parameters.pytorch_stable_image >>
          name: Pull PyTorch Stable Image
  run_pytorch_container:
    steps:
      - run:
          command: |
            docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image >>
            docker exec -it pthd nvidia-smi
            docker exec -it pthd ls
          environment:
            wd: << pipeline.parameters.workingdir >>
          name: Start Pytorch container
  run_pytorch_devel_container:
    steps:
      - run:
          command: |
            docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image_devel >>
            docker exec -it pthd nvidia-smi
            docker exec -it pthd ls
          environment:
            wd: << pipeline.parameters.workingdir >>
          name: Start Pytorch dev container
executors:
  one_gpu:
    machine:
      docker_layer_caching: true
      image: ubuntu-1604-cuda-11.1:202012-01
    resource_class: gpu.small
  one_gpu_windows:
    machine:
      image: windows-server-2019-nvidia:stable
      resource_class: windows.gpu.nvidia.medium
      shell: bash.exe
  two_gpus:
    machine:
      docker_layer_caching: true
      image: ubuntu-1604-cuda-11.1:202012-01
    resource_class: gpu.medium
jobs:
  build_publish_docker_images:
    docker:
      - image: cimg/python:3.8.8
    resource_class: 2xlarge
    steps:
      - checkout
      - setup_remote_docker:
          docker_layer_caching: true
          version: 19.03.14
      - run:
          command: |
            pip --version
            pip install docker
          name: Install deps
      - run:
          command: |
            cd docker
            export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >>
            export HVD_VERSION=<< pipeline.parameters.build_docker_image_hvd_version >>
            bash build.sh hvd hvd-base
            bash build.sh hvd hvd-vision
            bash build.sh hvd hvd-nlp
            bash build.sh hvd hvd-apex
            bash build.sh hvd hvd-apex-vision
            bash build.sh hvd hvd-apex-nlp
          name: Build all Horovod flavoured PyTorch-Ignite images
      - run:
          command: |
            cd docker
            export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >>
            bash build.sh main base
            bash build.sh main vision
            bash build.sh main nlp
            bash build.sh main apex
            bash build.sh main apex-vision
            bash build.sh main apex-nlp
          name: Build all PyTorch-Ignite images
      - run:
          command: |
            cd docker
            export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >>
            export MSDP_VERSION=<< pipeline.parameters.build_docker_image_msdp_version >>
            bash build.sh msdp msdp-apex
            bash build.sh msdp msdp-apex-vision
            bash build.sh msdp msdp-apex-nlp
          name: Build all MS DeepSpeed flavoured PyTorch-Ignite images
      - run:
          command: docker images | grep pytorchignite
          name: List built images
      - when:
          condition: << pipeline.parameters.should_publish_docker_images >>
          steps:
            - run:
                command: |
                  cd docker
                  sh ./push_all.sh
                name: Push all PyTorch-Ignite Docker images
    working_directory: << pipeline.parameters.workingdir >>
  one_gpu_tests:
    executor: one_gpu
    steps:
      - checkout
      - run:
          command: |
            bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
          name: Trigger job if modified
      - pull_pytorch_stable_image
      - run_pytorch_container
      - install_dependencies
      - run:
          command: |4

                    # pytest on cuda
                    export test_cmd='bash tests/run_gpu_tests.sh'
                    docker exec -it pthd /bin/bash -c "${test_cmd}"

                    # MNIST tests

                    # 0) download MNIST
                    # https://github.com/pytorch/ignite/issues/1737
                    export raw_mnist_dir='./MNIST/raw'
                    export download_mnist_cmd="git clone https://github.com/pytorch-ignite/download-mnist-github-action.git $raw_mnist_dir"
                    docker exec -it pthd /bin/bash -c "$download_mnist_cmd"
                    export mnist0_cmd="CUDA_VISIBLE_DEVICES=0 python $raw_mnist_dir/run.py ."
                    docker exec -it pthd /bin/bash -c "$mnist0_cmd"

                    # 1) mnist.py
                    export minst1_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist.py --epochs=1'
                    docker exec -it pthd /bin/bash -c "$minst1_cmd"

                    # 2) mnist_with_visdom.py
                    export visdom_script_cmd='python -c "from visdom.server import download_scripts; download_scripts()"'
                    export visdom_cmd='python -m visdom.server'
                    docker exec -d pthd /bin/bash -c "$visdom_script_cmd && $visdom_cmd"
                    export sleep_cmd='sleep 10'
                    export mnist2_cmd='python examples/mnist/mnist_with_visdom.py --epochs=1'
                    docker exec -it pthd /bin/bash -c "$sleep_cmd && $mnist2_cmd"

                    # 3.1) mnist_with_tensorboard.py with tbX
                    export mnist3_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_with_tensorboard.py --epochs=1'
                    docker exec -it pthd /bin/bash -c "$mnist3_cmd"

                    # uninstall tensorboardX
                    export pip_cmd='pip uninstall -y tensorboardX'
                    docker exec -it pthd /bin/bash -c "$pip_cmd"

                    # 3.2) mnist_with_tensorboard.py with native torch tb
                    docker exec -it pthd /bin/bash -c "$mnist3_cmd"

                    # 4) mnist_save_resume_engine.py
                    # save
                    export mnist4_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_save_resume_engine.py --epochs=2 --crash_iteration 1100'
                    docker exec -it pthd /bin/bash -c "$mnist4_cmd"
                    # resume
                    export mnist4_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_save_resume_engine.py --epochs=2 --resume_from=/tmp/mnist_save_resume/checkpoint_1.pt'
                    docker exec -it pthd /bin/bash -c "$mnist4_cmd"
          name: Run GPU Unit Tests and Examples
      - run:
          command: |
            bash <(curl -s https://codecov.io/bash) -Z -F gpu
          name: Codecov upload
    working_directory: << pipeline.parameters.workingdir >>
  one_gpu_windows_tests:
    executor: one_gpu_windows
    steps:
      - checkout
      - run:
          command: |
            bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
          name: Trigger job if modified
      - run:
          command: |
            conda --version
            conda install -y pytorch torchvision cudatoolkit=11.1 -c pytorch -c conda-forge
            pip install -r requirements-dev.txt
            pip install .
          name: Install dependencies
      - run:
          command: |
            # pytest on cuda
            SKIP_DISTRIB_TESTS=1 bash tests/run_gpu_tests.sh
          name: Run GPU Unit Tests
    working_directory: << pipeline.parameters.workingdir >>
  two_gpus_check_dist_cifar10_example:
    executor: two_gpus
    steps:
      - checkout
      - run:
          command: |
            bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
          name: Trigger job if modified
      - pull_pytorch_stable_image
      - run_pytorch_container
      - install_dependencies
      - run:
          command: |
            docker exec -it pthd pip install fire
          name: Install additional example dependencies
      - run:
          command: |
            export example_path="examples/contrib/cifar10"
            # initial run
            export stop_cmd="--stop_iteration=500"
            export test_cmd="CI=1 python ${example_path}/main.py run --checkpoint_every=200"
            docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
            # resume
            export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-500/training_checkpoint_400.pt"
            docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
          name: Run without backend
      - run:
          command: |
            export example_path="examples/contrib/cifar10"
            # initial run
            export stop_cmd="--stop_iteration=500"
            export test_cmd="CI=1 python -u -m torch.distributed.launch --nproc_per_node=2 --use_env ${example_path}/main.py run --backend=nccl --checkpoint_every=200"
            docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
            # resume
            export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-500/training_checkpoint_400.pt"
            docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
          name: Run with NCCL backend using torch dist launch
      - run:
          command: |
            export example_path="examples/contrib/cifar10"
            # initial run
            export stop_cmd="--stop_iteration=500"
            export test_cmd="CI=1 python -u ${example_path}/main.py run --backend=nccl --nproc_per_node=2 --checkpoint_every=200"
            docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
            # resume
            export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-500/training_checkpoint_400.pt"
            docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
          name: Run with NCCL backend using spawn
    working_directory: << pipeline.parameters.workingdir >>
  two_gpus_hvd_tests:
    executor: two_gpus
    steps:
      - checkout
      - run:
          command: |
            bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
          name: Trigger job if modified
      - pull_pytorch_stable_devel_image
      - run_pytorch_devel_container
      - install_dependencies
      - run:
          command: |4

                    # Following https://github.com/horovod/horovod/blob/master/Dockerfile.test.gpu
                    # and https://github.com/horovod/horovod/issues/1944#issuecomment-628192778
                    docker exec -it pthd /bin/bash -c "apt-get update && apt-get install -y git"
                    docker exec -it pthd /bin/bash -c "git clone --recursive https://github.com/horovod/horovod.git /horovod && cd /horovod && python setup.py sdist"
                    docker exec -it pthd /bin/bash -c "conda install -y cmake nccl=2.8 -c conda-forge"
                    docker exec -it pthd /bin/bash -c 'cd /horovod && HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_LINK=SHARED HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_PYTORCH=1 pip install -v $(ls /horovod/dist/horovod-*.tar.gz) && ldconfig'
                    docker exec -it pthd horovodrun --check-build
          name: Install Horovod with NCCL GPU ops
      - run:
          command: |
            export test_cmd='bash tests/run_gpu_tests.sh'
            docker exec -it pthd /bin/bash -c "${test_cmd}"
            # no CUDA devices Horovod tests
            export test_cmd='CUDA_VISIBLE_DEVICES="" pytest --cov ignite --cov-append --cov-report term-missing --cov-report xml -vvv tests/ -m distributed'
            docker exec -it pthd /bin/bash -c "${test_cmd}"
          name: Run 1 Node 2 GPUs Unit Tests
      - run:
          command: |
            bash <(curl -s https://codecov.io/bash) -Z -F gpu-2-hvd
          name: Codecov upload
      - run:
          command: |
            docker exec -it pthd pip install fire
            export example_path="examples/contrib/cifar10"
            # initial run
            export stop_cmd="--stop_iteration=500"
            export test_cmd="cd ${example_path} && CI=1 horovodrun -np 2 python -u main.py run --backend=horovod --checkpoint_every=200"
            docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
            # resume
            export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-horovod-2_stop-on-500/training_checkpoint_400.pt"
            docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
          name: Check CIFAR10 using horovodrun
      - run:
          command: |
            export example_path="examples/contrib/cifar10"
            # initial run
            export stop_cmd="--stop_iteration=500"
            export test_cmd="cd ${example_path} && CI=1 python -u main.py run --backend=horovod --nproc_per_node=2 --checkpoint_every=200"
            docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
            # resume
            export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-horovod-2_stop-on-500/training_checkpoint_400.pt"
            docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
          name: Check CIFAR10 using spawn
    working_directory: << pipeline.parameters.workingdir >>
  two_gpus_tests:
    executor: two_gpus
    steps:
      - checkout
      - run:
          command: |
            bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
          name: Trigger job if modified
      - pull_pytorch_stable_image
      - run_pytorch_container
      - install_dependencies
      - run:
          command: |
            export test_cmd='bash tests/run_gpu_tests.sh 2'
            docker exec -it pthd /bin/bash -c "${test_cmd}"
          name: Run 1 Node 2 GPUs Unit Tests
      - run:
          command: |
            bash <(curl -s https://codecov.io/bash) -Z -F gpu-2
          name: Codecov upload
    working_directory: << pipeline.parameters.workingdir >>
parameters:
  build_docker_image_hvd_version:
    default: v0.21.3
    type: string
  build_docker_image_msdp_version:
    default: v0.3.10
    type: string
  build_docker_image_pytorch_version:
    default: 1.8.1-cuda11.1-cudnn8
    type: string
  pytorch_stable_image:
    default: pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
    type: string
  pytorch_stable_image_devel:
    default: pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel
    type: string
  should_build_docker_images:
    default: false
    type: boolean
  should_publish_docker_images:
    default: false
    type: boolean
  workingdir:
    default: /tmp/ignite
    type: string
version: 2.1
workflows:
  docker_images:
    jobs:
      - build_publish_docker_images
    when: << pipeline.parameters.should_build_docker_images >>
  gpu_tests:
    jobs:
      - one_gpu_tests
      - one_gpu_windows_tests
      - two_gpus_tests
      - two_gpus_check_dist_cifar10_example
      - two_gpus_hvd_tests
    unless: << pipeline.parameters.should_build_docker_images >>
  version: 2

ydcjeff avatar Mar 26 '21 10:03 ydcjeff

any idea how can we import yaml inside yaml ?

I think only GitLab has that feature

Gitlab has a similar feature using the include keyword to include workflow templates; and in addition the extends keyword can be used to share small bits of yaml within the same yaml file

trsvchn avatar Mar 26 '21 10:03 trsvchn

@ydcjeff thanks for providing these options! Yes, there are pros/cons in all those approaches. Maybe, a third approach is to read docker values similarly to

python -c "import yaml; f=open('.circleci/config.yml'); d=yaml.safe_load(f); print(d['parameters']['build_docker_image_pytorch_version']['default'])"

vfdev-5 avatar Mar 26 '21 12:03 vfdev-5

@trsvchn or @ydcjeff would you like to solve this issue. I'd like to build new docker images this week.

EDIT: Probably, we can do that manually for now, before the issue has been solved

vfdev-5 avatar Mar 30 '21 09:03 vfdev-5

@vfdev-5 I have another idea, but I have zero experience with circleci, can we do smth like this?

Simply use Makefile with defined versions:

# Makefile

BUILD_DOCKER_IMAGE_PYTORCH_VERSION = 1.8.1-cuda11.1-cudnn8                                  
BUILD_DOCKER_IMAGE_HVD_VERSION = v0.21.3
BUILD_DOCKER_IMAGE_MSDP_VERSION = v0.3.10


get_build_docker_image_pytorch_version:
        @echo $(BUILD_DOCKER_IMAGE_PYTORCH_VERSION)

get_build_docker_image_hvd_version:
        @echo $(BUILD_DOCKER_IMAGE_HVD_VERSION)

get_build_docker_image_msdp_version:
        @echo $(BUILD_DOCKER_IMAGE_MSDP_VERSION)

Then use it inside circleci.config (if possible)

# to get pytorch verison
build_docker_image_pytorch_version = make get_build_docker_image_pytorch_version
...

And the same for GHA:

           export PTH_VERSION=`make get_build_docker_image_pytorch_version` 

trsvchn avatar Mar 30 '21 11:03 trsvchn

Yes, we can do something like that but I'm not a fan of using another scripting langs in addition to bash and python... We can think of https://github.com/pydoit/doit or python for that if needed.

vfdev-5 avatar Mar 30 '21 11:03 vfdev-5

Yeah, agree Makefile is not very obvious tool, There is "the strangely familiar workflow utility " from Ken Reitz: https://github.com/kenreitz42/bake

trsvchn avatar Mar 30 '21 11:03 trsvchn

No, let's keep things without new deps

vfdev-5 avatar Mar 30 '21 11:03 vfdev-5

@ydcjeff thanks for providing these options! Yes, there are pros/cons in all those approaches. Maybe, a third approach is to read docker values similarly to

python -c "import yaml; f=open('.circleci/config.yml'); d=yaml.safe_load(f); print(d['parameters']['build_docker_image_pytorch_version']['default'])"

Another idea is to add these lines to new docker.cfg ini file, then no need to use pyaml, and we have strings here

[DEFAULT]
build_docker_image_pytorch_version = 1.8.1-cuda11.1-cudnn8                                  
build_docker_image_hvd_version = v0.21.3
build_docker_image_msdp_version = v0.3.10

Then:

python -c "import configparser; print(configparser.ConfigParser().read('docker.cfg')['DEFAULT']['build_docker_image_pytorch_version'])"

trsvchn avatar Mar 30 '21 12:03 trsvchn

Sounds good @trsvchn

vfdev-5 avatar Mar 30 '21 12:03 vfdev-5