CI for GPU

Open alazzaro opened this issue 6 years ago • 5 comments

This is a place holder for discussion, we really need to implement a CI for GPU.

alazzaro avatar Feb 20 '19 16:02 alazzaro

I very much agree. Are there any current obstacles that need to be lifted in order to get a CI for GPU and what needs to be done?

shoshijak avatar Mar 04 '19 08:03 shoshijak

my plan is/was:

  • Jenkins CI at CSCS to run on Daint GPU partition: Jenkins is ready, only needs configuration. Since it is tied to projects (s238, g90), only me, Jürg and Alfio have access to those instances atm (+Patrick for the s238).
  • Jenkins CI at HutterGroup infra: Host is there, only have to install a Docker image of Jenkins and configure it. Testing on tcgpu1 or tcgpu2 using the Nvidia Docker (packages for openSUSE are already prepared). This I started last week, but due to deadlines approaching for other projects, I will not be able to continue until end of next week.

dev-zero avatar Mar 04 '19 08:03 dev-zero

Great. If there's something I can help with, do let me know. Good luck with the approaching deadlines :)

shoshijak avatar Mar 04 '19 08:03 shoshijak

Progress report: I have a PoC running with the following Jenkins (@CSCS) pipeline configuration:

node {
   stage('checkout') {
        checkout([$class: 'GitSCM',
            userRemoteConfigs: [[url: 'https://github.com/cp2k/dbcsr.git']],
            branches: [[name: '*/develop']],
            browser: [$class: 'GithubWeb', repoUrl: 'https://github.com/cp2k/dbcsr'],
            doGenerateSubmoduleConfigurations: false,
            extensions: [[$class: 'SubmoduleOption',
                disableSubmodules: false,
                parentCredentials: false,
                recursiveSubmodules: true,
                reference: '',
                trackingSubmodules: false]],
            submoduleCfg: []


   stage('build&test') {
        sh 'sbatch --account="${JOB_NAME%%/*}" --job-name="${JOB_BASE_NAME}" --wait /users/timuel/job.sh'
#!/bin/bash -l

#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --time="1:00:00"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT

set -o errexit
set -o nounset
set -o pipefail

module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu cudatoolkit CMake/3.12.0
module unload cray-libsci_acc

set -o xtrace

umask 0002  # make sure group members can access the data

mkdir --mode=0775 -p "${SCRATCH}/${BUILD_TAG}"

cmake \
    -DWITH_GPU=P100 \
    -DMPIEXEC_EXECUTABLE="$(command -v srun)" \
    "${WORKSPACE}" |& tee cmake.out

make VERBOSE=1 -j |& tee make.out

export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake

# document the current environment
env |& tee env.out

env CTEST_OUTPUT_ON_FAILURE=1 make test |& tee make-test.out

What's left:

  • [x] GitHub integration
  • [x] Decide on how and where to store the job.sh
  • [x] Figure out how to get the actual test output back to Jenkins (probably a readFile + echo)
  • [x] Handle slurm errors properly (it seems that when the job gets killed due to the timelimit sbatch still returns an error code of 0, probably also for other sorts of errors in the scheduler)
  • [x] Split build and test jobs for better tracking via Jenkins (possibly even test parallelization)

Note wrt the handling of slurm errors: sbatch --wait returns a non-0 if the script returned with a non-0. Likewise should it return with non-0 if there was a problem on the scheduler-side itself. So, in both cases should we see a failure in that step. Although, in my tests I had a timelimit-reached termination once which resulted in no step failure.

dev-zero avatar Apr 11 '19 13:04 dev-zero

keeping this open for the CI on our infra

dev-zero avatar Jun 04 '19 08:06 dev-zero