ClimaCoupler.jl icon indicating copy to clipboard operation
ClimaCoupler.jl copied to clipboard

debug 2 GPU error

Open juliasloan25 opened this issue 8 months ago • 0 comments

Purpose

debug error detailed in https://github.com/CliMA/ClimaCoupler.jl/issues/687

Examples

  • 2 GPU benchmark run on clima fails: https://buildkite.com/clima/climacoupler-cpu-gpu-benchmarks/builds/168#_
  • 2 GPU versions of regular CI on new-central pass: https://buildkite.com/clima/climacoupler-ci/builds/4042

To-do

  • [ ] reproduce bug
    • run 2-GPU jobs (on clima ?)

Content

To reproduce (full driver)

run the following example on buildkite:

      - label: "2 GPU AMIP with diagnostic EDMF"
        key: "gpu_2_amip_diagedmf"
        command: "srun julia --threads=3 --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/run_amip.jl --config_file config/benchmark_configs/amip_diagedmf.yml --job_id gpu_2_amip_diagedmf"
        artifact_paths: "experiments/ClimaEarth/output/amip/gpu_2_amip_diagedmf_artifacts/*"
        agents:
          slurm_gpus_per_task: 1
          slurm_cpus_per_task: 4
          slurm_ntasks: 2
          slurm_mem: 16GB

or, to run interactively, enter a repl with two tasks having one gpu each, and run the driver experiments/ClimaEarth/run_amip.jl with the config file config/benchmark_configs/amip_diagedmf.yml

To reproduce (MRE)

in progress...

juliasloan25 avatar Jun 18 '24 22:06 juliasloan25