ClimaCoupler.jl
ClimaCoupler.jl copied to clipboard
debug 2 GPU error
Purpose
debug error detailed in https://github.com/CliMA/ClimaCoupler.jl/issues/687
Examples
- 2 GPU benchmark run on clima fails: https://buildkite.com/clima/climacoupler-cpu-gpu-benchmarks/builds/168#_
- 2 GPU versions of regular CI on new-central pass: https://buildkite.com/clima/climacoupler-ci/builds/4042
To-do
- [ ] reproduce bug
- run 2-GPU jobs (on clima ?)
Content
To reproduce (full driver)
run the following example on buildkite:
- label: "2 GPU AMIP with diagnostic EDMF"
key: "gpu_2_amip_diagedmf"
command: "srun julia --threads=3 --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/run_amip.jl --config_file config/benchmark_configs/amip_diagedmf.yml --job_id gpu_2_amip_diagedmf"
artifact_paths: "experiments/ClimaEarth/output/amip/gpu_2_amip_diagedmf_artifacts/*"
agents:
slurm_gpus_per_task: 1
slurm_cpus_per_task: 4
slurm_ntasks: 2
slurm_mem: 16GB
or, to run interactively, enter a repl with two tasks having one gpu each, and run the driver experiments/ClimaEarth/run_amip.jl
with the config file config/benchmark_configs/amip_diagedmf.yml
To reproduce (MRE)
in progress...