ignite
ignite copied to clipboard
Improve versions update for docker building
- https://github.com/pytorch/ignite/pull/1878#issuecomment-808041919
cc @trsvchn @ydcjeff
@vfdev-5 any idea how can we import yaml inside yaml ? I found out that circleci yaml is just vanilla yaml, so we can't do normally.
Two options I found out currently
- use pyyaml and generate
config.ymljust like pytorch and domain repos are using - use
circlecicli to generate validconfig.ymlusingcircleci config packcommand
I haven't tried option 1, but have tried option 2: Dir Structure
.circleci/src
├── commands
│ ├── install_dependencies.yml
│ ├── install_latest_nvidia.yml
│ ├── pull_pytorch_stable_devel_image.yml
│ ├── pull_pytorch_stable_image.yml
│ ├── run_pytorch_container.yml
│ └── run_pytorch_devel_container.yml
├── config.yml
├── executors
│ ├── one_gpu.yml
│ ├── one_gpu_windows.yml
│ └── two_gpus.yml
└── jobs
├── build_publish_docker_images.yml
├── one_gpu_tests.yml
├── one_gpu_windows_tests.yml
├── two_gpus_check_dist_cifar10_example.yml
├── two_gpus_hvd_tests.yml
└── two_gpus_tests.yml
3 directories, 16 files
The folder names define the names we defined in .circleci/config.yml (jobs is jobs which will contain the jobs we will run, same to commands, executors). What I don't like is this creates many files which only contains a small amount of commands. But what do you think tho ? @vfdev-5 @trsvchn
What's inside above config.yml:
version: 2.1
parameters:
pytorch_stable_image:
type: string
# https://hub.docker.com/r/pytorch/pytorch/tags
default: "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime"
pytorch_stable_image_devel:
type: string
# https://hub.docker.com/r/pytorch/pytorch/tags
default: "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel"
workingdir:
type: string
default: "/tmp/ignite"
should_build_docker_images:
type: boolean
default: false
should_publish_docker_images:
type: boolean
default: false
build_docker_image_pytorch_version:
type: string
default: "1.8.1-cuda11.1-cudnn8"
build_docker_image_hvd_version:
type: string
default: "v0.21.3"
build_docker_image_msdp_version:
type: string
default: "v0.3.10"
workflows:
version: 2
gpu_tests:
unless: << pipeline.parameters.should_build_docker_images >>
jobs:
- one_gpu_tests
- one_gpu_windows_tests
- two_gpus_tests
- two_gpus_check_dist_cifar10_example
- two_gpus_hvd_tests
docker_images:
when: << pipeline.parameters.should_build_docker_images >>
jobs:
- build_publish_docker_images
Here's option 2 output:
commands:
install_dependencies:
steps:
- run:
command: |
docker exec -it pthd pip install -r requirements-dev.txt
export install_apex_cmd='pip install -v --disable-pip-version-check --no-cache-dir git+https://github.com/NVIDIA/apex'
export install_git_apex_cmd="apt-get update && apt-get install -y --no-install-recommends git && ${install_apex_cmd}"
docker exec -it pthd /bin/bash -c "$install_git_apex_cmd"
export install_ignite_cmd='python setup.py install'
docker exec -it pthd /bin/bash -c "$install_ignite_cmd"
name: Install dependencies
install_latest_nvidia:
steps:
- run:
command: |
sudo apt-get purge nvidia* && sudo apt-get autoremove
sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-455 cuda-drivers-455
# Install nvidia-container-runtime
sudo apt-get install -y nvidia-container-runtime
# Reload driver : https://stackoverflow.com/a/45319156/6309199
# lsof | grep nvidia -> kill Xvfb
sudo lsof | grep "/usr/bin/Xvfb" | head -1 | awk '{print $2}' | xargs -I {} sudo kill -9 {}
# lsmod | grep nvidia
sudo rmmod nvidia_uvm && sudo rmmod nvidia_drm && sudo rmmod nvidia_modeset && sudo rmmod nvidia
# reload driver
nvidia-smi
name: Install latest NVidia-driver and CUDA
pull_pytorch_stable_devel_image:
steps:
- run:
command: |
docker pull << pipeline.parameters.pytorch_stable_image_devel >>
name: Pull PyTorch Stable Develop Image
pull_pytorch_stable_image:
steps:
- run:
command: |
docker pull << pipeline.parameters.pytorch_stable_image >>
name: Pull PyTorch Stable Image
run_pytorch_container:
steps:
- run:
command: |
docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image >>
docker exec -it pthd nvidia-smi
docker exec -it pthd ls
environment:
wd: << pipeline.parameters.workingdir >>
name: Start Pytorch container
run_pytorch_devel_container:
steps:
- run:
command: |
docker run --gpus=all --rm -itd --shm-size 16G -v ${wd}:/ignite -w /ignite --name pthd << pipeline.parameters.pytorch_stable_image_devel >>
docker exec -it pthd nvidia-smi
docker exec -it pthd ls
environment:
wd: << pipeline.parameters.workingdir >>
name: Start Pytorch dev container
executors:
one_gpu:
machine:
docker_layer_caching: true
image: ubuntu-1604-cuda-11.1:202012-01
resource_class: gpu.small
one_gpu_windows:
machine:
image: windows-server-2019-nvidia:stable
resource_class: windows.gpu.nvidia.medium
shell: bash.exe
two_gpus:
machine:
docker_layer_caching: true
image: ubuntu-1604-cuda-11.1:202012-01
resource_class: gpu.medium
jobs:
build_publish_docker_images:
docker:
- image: cimg/python:3.8.8
resource_class: 2xlarge
steps:
- checkout
- setup_remote_docker:
docker_layer_caching: true
version: 19.03.14
- run:
command: |
pip --version
pip install docker
name: Install deps
- run:
command: |
cd docker
export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >>
export HVD_VERSION=<< pipeline.parameters.build_docker_image_hvd_version >>
bash build.sh hvd hvd-base
bash build.sh hvd hvd-vision
bash build.sh hvd hvd-nlp
bash build.sh hvd hvd-apex
bash build.sh hvd hvd-apex-vision
bash build.sh hvd hvd-apex-nlp
name: Build all Horovod flavoured PyTorch-Ignite images
- run:
command: |
cd docker
export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >>
bash build.sh main base
bash build.sh main vision
bash build.sh main nlp
bash build.sh main apex
bash build.sh main apex-vision
bash build.sh main apex-nlp
name: Build all PyTorch-Ignite images
- run:
command: |
cd docker
export PTH_VERSION=<< pipeline.parameters.build_docker_image_pytorch_version >>
export MSDP_VERSION=<< pipeline.parameters.build_docker_image_msdp_version >>
bash build.sh msdp msdp-apex
bash build.sh msdp msdp-apex-vision
bash build.sh msdp msdp-apex-nlp
name: Build all MS DeepSpeed flavoured PyTorch-Ignite images
- run:
command: docker images | grep pytorchignite
name: List built images
- when:
condition: << pipeline.parameters.should_publish_docker_images >>
steps:
- run:
command: |
cd docker
sh ./push_all.sh
name: Push all PyTorch-Ignite Docker images
working_directory: << pipeline.parameters.workingdir >>
one_gpu_tests:
executor: one_gpu
steps:
- checkout
- run:
command: |
bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
name: Trigger job if modified
- pull_pytorch_stable_image
- run_pytorch_container
- install_dependencies
- run:
command: |4
# pytest on cuda
export test_cmd='bash tests/run_gpu_tests.sh'
docker exec -it pthd /bin/bash -c "${test_cmd}"
# MNIST tests
# 0) download MNIST
# https://github.com/pytorch/ignite/issues/1737
export raw_mnist_dir='./MNIST/raw'
export download_mnist_cmd="git clone https://github.com/pytorch-ignite/download-mnist-github-action.git $raw_mnist_dir"
docker exec -it pthd /bin/bash -c "$download_mnist_cmd"
export mnist0_cmd="CUDA_VISIBLE_DEVICES=0 python $raw_mnist_dir/run.py ."
docker exec -it pthd /bin/bash -c "$mnist0_cmd"
# 1) mnist.py
export minst1_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist.py --epochs=1'
docker exec -it pthd /bin/bash -c "$minst1_cmd"
# 2) mnist_with_visdom.py
export visdom_script_cmd='python -c "from visdom.server import download_scripts; download_scripts()"'
export visdom_cmd='python -m visdom.server'
docker exec -d pthd /bin/bash -c "$visdom_script_cmd && $visdom_cmd"
export sleep_cmd='sleep 10'
export mnist2_cmd='python examples/mnist/mnist_with_visdom.py --epochs=1'
docker exec -it pthd /bin/bash -c "$sleep_cmd && $mnist2_cmd"
# 3.1) mnist_with_tensorboard.py with tbX
export mnist3_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_with_tensorboard.py --epochs=1'
docker exec -it pthd /bin/bash -c "$mnist3_cmd"
# uninstall tensorboardX
export pip_cmd='pip uninstall -y tensorboardX'
docker exec -it pthd /bin/bash -c "$pip_cmd"
# 3.2) mnist_with_tensorboard.py with native torch tb
docker exec -it pthd /bin/bash -c "$mnist3_cmd"
# 4) mnist_save_resume_engine.py
# save
export mnist4_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_save_resume_engine.py --epochs=2 --crash_iteration 1100'
docker exec -it pthd /bin/bash -c "$mnist4_cmd"
# resume
export mnist4_cmd='CUDA_VISIBLE_DEVICES=0 python examples/mnist/mnist_save_resume_engine.py --epochs=2 --resume_from=/tmp/mnist_save_resume/checkpoint_1.pt'
docker exec -it pthd /bin/bash -c "$mnist4_cmd"
name: Run GPU Unit Tests and Examples
- run:
command: |
bash <(curl -s https://codecov.io/bash) -Z -F gpu
name: Codecov upload
working_directory: << pipeline.parameters.workingdir >>
one_gpu_windows_tests:
executor: one_gpu_windows
steps:
- checkout
- run:
command: |
bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
name: Trigger job if modified
- run:
command: |
conda --version
conda install -y pytorch torchvision cudatoolkit=11.1 -c pytorch -c conda-forge
pip install -r requirements-dev.txt
pip install .
name: Install dependencies
- run:
command: |
# pytest on cuda
SKIP_DISTRIB_TESTS=1 bash tests/run_gpu_tests.sh
name: Run GPU Unit Tests
working_directory: << pipeline.parameters.workingdir >>
two_gpus_check_dist_cifar10_example:
executor: two_gpus
steps:
- checkout
- run:
command: |
bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
name: Trigger job if modified
- pull_pytorch_stable_image
- run_pytorch_container
- install_dependencies
- run:
command: |
docker exec -it pthd pip install fire
name: Install additional example dependencies
- run:
command: |
export example_path="examples/contrib/cifar10"
# initial run
export stop_cmd="--stop_iteration=500"
export test_cmd="CI=1 python ${example_path}/main.py run --checkpoint_every=200"
docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
# resume
export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-500/training_checkpoint_400.pt"
docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
name: Run without backend
- run:
command: |
export example_path="examples/contrib/cifar10"
# initial run
export stop_cmd="--stop_iteration=500"
export test_cmd="CI=1 python -u -m torch.distributed.launch --nproc_per_node=2 --use_env ${example_path}/main.py run --backend=nccl --checkpoint_every=200"
docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
# resume
export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-500/training_checkpoint_400.pt"
docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
name: Run with NCCL backend using torch dist launch
- run:
command: |
export example_path="examples/contrib/cifar10"
# initial run
export stop_cmd="--stop_iteration=500"
export test_cmd="CI=1 python -u ${example_path}/main.py run --backend=nccl --nproc_per_node=2 --checkpoint_every=200"
docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
# resume
export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-500/training_checkpoint_400.pt"
docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
name: Run with NCCL backend using spawn
working_directory: << pipeline.parameters.workingdir >>
two_gpus_hvd_tests:
executor: two_gpus
steps:
- checkout
- run:
command: |
bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
name: Trigger job if modified
- pull_pytorch_stable_devel_image
- run_pytorch_devel_container
- install_dependencies
- run:
command: |4
# Following https://github.com/horovod/horovod/blob/master/Dockerfile.test.gpu
# and https://github.com/horovod/horovod/issues/1944#issuecomment-628192778
docker exec -it pthd /bin/bash -c "apt-get update && apt-get install -y git"
docker exec -it pthd /bin/bash -c "git clone --recursive https://github.com/horovod/horovod.git /horovod && cd /horovod && python setup.py sdist"
docker exec -it pthd /bin/bash -c "conda install -y cmake nccl=2.8 -c conda-forge"
docker exec -it pthd /bin/bash -c 'cd /horovod && HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_LINK=SHARED HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_PYTORCH=1 pip install -v $(ls /horovod/dist/horovod-*.tar.gz) && ldconfig'
docker exec -it pthd horovodrun --check-build
name: Install Horovod with NCCL GPU ops
- run:
command: |
export test_cmd='bash tests/run_gpu_tests.sh'
docker exec -it pthd /bin/bash -c "${test_cmd}"
# no CUDA devices Horovod tests
export test_cmd='CUDA_VISIBLE_DEVICES="" pytest --cov ignite --cov-append --cov-report term-missing --cov-report xml -vvv tests/ -m distributed'
docker exec -it pthd /bin/bash -c "${test_cmd}"
name: Run 1 Node 2 GPUs Unit Tests
- run:
command: |
bash <(curl -s https://codecov.io/bash) -Z -F gpu-2-hvd
name: Codecov upload
- run:
command: |
docker exec -it pthd pip install fire
export example_path="examples/contrib/cifar10"
# initial run
export stop_cmd="--stop_iteration=500"
export test_cmd="cd ${example_path} && CI=1 horovodrun -np 2 python -u main.py run --backend=horovod --checkpoint_every=200"
docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
# resume
export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-horovod-2_stop-on-500/training_checkpoint_400.pt"
docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
name: Check CIFAR10 using horovodrun
- run:
command: |
export example_path="examples/contrib/cifar10"
# initial run
export stop_cmd="--stop_iteration=500"
export test_cmd="cd ${example_path} && CI=1 python -u main.py run --backend=horovod --nproc_per_node=2 --checkpoint_every=200"
docker exec -it pthd /bin/bash -c "${test_cmd} ${stop_cmd}"
# resume
export resume_opt="--resume-from=/tmp/output-cifar10/resnet18_backend-horovod-2_stop-on-500/training_checkpoint_400.pt"
docker exec -it pthd /bin/bash -c "${test_cmd} --num_epochs=7 ${resume_opt}"
name: Check CIFAR10 using spawn
working_directory: << pipeline.parameters.workingdir >>
two_gpus_tests:
executor: two_gpus
steps:
- checkout
- run:
command: |
bash .circleci/trigger_if_modified.sh "^(ignite|tests|examples|\.circleci).*"
name: Trigger job if modified
- pull_pytorch_stable_image
- run_pytorch_container
- install_dependencies
- run:
command: |
export test_cmd='bash tests/run_gpu_tests.sh 2'
docker exec -it pthd /bin/bash -c "${test_cmd}"
name: Run 1 Node 2 GPUs Unit Tests
- run:
command: |
bash <(curl -s https://codecov.io/bash) -Z -F gpu-2
name: Codecov upload
working_directory: << pipeline.parameters.workingdir >>
parameters:
build_docker_image_hvd_version:
default: v0.21.3
type: string
build_docker_image_msdp_version:
default: v0.3.10
type: string
build_docker_image_pytorch_version:
default: 1.8.1-cuda11.1-cudnn8
type: string
pytorch_stable_image:
default: pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
type: string
pytorch_stable_image_devel:
default: pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel
type: string
should_build_docker_images:
default: false
type: boolean
should_publish_docker_images:
default: false
type: boolean
workingdir:
default: /tmp/ignite
type: string
version: 2.1
workflows:
docker_images:
jobs:
- build_publish_docker_images
when: << pipeline.parameters.should_build_docker_images >>
gpu_tests:
jobs:
- one_gpu_tests
- one_gpu_windows_tests
- two_gpus_tests
- two_gpus_check_dist_cifar10_example
- two_gpus_hvd_tests
unless: << pipeline.parameters.should_build_docker_images >>
version: 2
any idea how can we import yaml inside yaml ?
I think only GitLab has that feature
Gitlab has a similar feature using the include keyword to include workflow templates; and in addition the extends keyword can be used to share small bits of yaml within the same yaml file
@ydcjeff thanks for providing these options! Yes, there are pros/cons in all those approaches. Maybe, a third approach is to read docker values similarly to
python -c "import yaml; f=open('.circleci/config.yml'); d=yaml.safe_load(f); print(d['parameters']['build_docker_image_pytorch_version']['default'])"
@trsvchn or @ydcjeff would you like to solve this issue. I'd like to build new docker images this week.
EDIT: Probably, we can do that manually for now, before the issue has been solved
@vfdev-5 I have another idea, but I have zero experience with circleci, can we do smth like this?
Simply use Makefile with defined versions:
# Makefile
BUILD_DOCKER_IMAGE_PYTORCH_VERSION = 1.8.1-cuda11.1-cudnn8
BUILD_DOCKER_IMAGE_HVD_VERSION = v0.21.3
BUILD_DOCKER_IMAGE_MSDP_VERSION = v0.3.10
get_build_docker_image_pytorch_version:
@echo $(BUILD_DOCKER_IMAGE_PYTORCH_VERSION)
get_build_docker_image_hvd_version:
@echo $(BUILD_DOCKER_IMAGE_HVD_VERSION)
get_build_docker_image_msdp_version:
@echo $(BUILD_DOCKER_IMAGE_MSDP_VERSION)
Then use it inside circleci.config (if possible)
# to get pytorch verison
build_docker_image_pytorch_version = make get_build_docker_image_pytorch_version
...
And the same for GHA:
export PTH_VERSION=`make get_build_docker_image_pytorch_version`
Yes, we can do something like that but I'm not a fan of using another scripting langs in addition to bash and python... We can think of https://github.com/pydoit/doit or python for that if needed.
Yeah, agree Makefile is not very obvious tool, There is "the strangely familiar workflow utility " from Ken Reitz:
https://github.com/kenreitz42/bake
No, let's keep things without new deps
@ydcjeff thanks for providing these options! Yes, there are pros/cons in all those approaches. Maybe, a third approach is to read docker values similarly to
python -c "import yaml; f=open('.circleci/config.yml'); d=yaml.safe_load(f); print(d['parameters']['build_docker_image_pytorch_version']['default'])"
Another idea is to add these lines to new docker.cfg ini file, then no need to use pyaml, and we have strings here
[DEFAULT]
build_docker_image_pytorch_version = 1.8.1-cuda11.1-cudnn8
build_docker_image_hvd_version = v0.21.3
build_docker_image_msdp_version = v0.3.10
Then:
python -c "import configparser; print(configparser.ConfigParser().read('docker.cfg')['DEFAULT']['build_docker_image_pytorch_version'])"
Sounds good @trsvchn