dbcsr
dbcsr copied to clipboard
CI for GPU
This is a place holder for discussion, we really need to implement a CI for GPU.
I very much agree. Are there any current obstacles that need to be lifted in order to get a CI for GPU and what needs to be done?
my plan is/was:
- Jenkins CI at CSCS to run on Daint GPU partition: Jenkins is ready, only needs configuration. Since it is tied to projects (s238, g90), only me, Jürg and Alfio have access to those instances atm (+Patrick for the s238).
- Jenkins CI at HutterGroup infra: Host is there, only have to install a Docker image of Jenkins and configure it. Testing on tcgpu1 or tcgpu2 using the Nvidia Docker (packages for openSUSE are already prepared). This I started last week, but due to deadlines approaching for other projects, I will not be able to continue until end of next week.
Great. If there's something I can help with, do let me know. Good luck with the approaching deadlines :)
Progress report: I have a PoC running with the following Jenkins (@CSCS) pipeline configuration:
node {
stage('checkout') {
checkout([$class: 'GitSCM',
userRemoteConfigs: [[url: 'https://github.com/cp2k/dbcsr.git']],
branches: [[name: '*/develop']],
browser: [$class: 'GithubWeb', repoUrl: 'https://github.com/cp2k/dbcsr'],
doGenerateSubmoduleConfigurations: false,
extensions: [[$class: 'SubmoduleOption',
disableSubmodules: false,
parentCredentials: false,
recursiveSubmodules: true,
reference: '',
trackingSubmodules: false]],
submoduleCfg: []
])
}
stage('build&test') {
sh 'sbatch --account="${JOB_NAME%%/*}" --job-name="${JOB_BASE_NAME}" --wait /users/timuel/job.sh'
}
}
#!/bin/bash -l
#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --time="1:00:00"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT
set -o errexit
set -o nounset
set -o pipefail
module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu cudatoolkit CMake/3.12.0
module unload cray-libsci_acc
set -o xtrace
umask 0002 # make sure group members can access the data
mkdir --mode=0775 -p "${SCRATCH}/${BUILD_TAG}"
cd "${SCRATCH}/${BUILD_TAG}"
cmake \
-DUSE_CUDA=ON \
-DUSE_CUBLAS=ON \
-DWITH_GPU=P100 \
-DMPIEXEC_EXECUTABLE="$(command -v srun)" \
-DTEST_MPI_RANKS=${SLURM_NTASKS} \
"${WORKSPACE}" |& tee cmake.out
make VERBOSE=1 -j |& tee make.out
export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake
# document the current environment
env |& tee env.out
env CTEST_OUTPUT_ON_FAILURE=1 make test |& tee make-test.out
What's left:
- [x] GitHub integration
- [x] Decide on how and where to store the
job.sh
- [x] Figure out how to get the actual test output back to Jenkins (probably a
readFile
+echo
) - [x] Handle slurm errors properly (it seems that when the job gets killed due to the timelimit
sbatch
still returns an error code of 0, probably also for other sorts of errors in the scheduler) - [x] Split build and test jobs for better tracking via Jenkins (possibly even test parallelization)
Note wrt the handling of slurm errors: sbatch --wait
returns a non-0 if the script returned with a non-0. Likewise should it return with non-0 if there was a problem on the scheduler-side itself. So, in both cases should we see a failure in that step. Although, in my tests I had a timelimit-reached termination once which resulted in no step failure.
keeping this open for the CI on our infra