CalibrateEDMF.jl icon indicating copy to clipboard operation
CalibrateEDMF.jl copied to clipboard

Buildkite tests fail randomly due to errors when using Distributed

Open ilopezgp opened this issue 2 years ago • 0 comments

Many of the buildkite tests now fail randomly due to an error that originates in the use of Distributed.

An example of this error is:

Stacktrace:
--
  | [1] (::Base.var"#898#900")(x::Task)
  | @ Base ./asyncmap.jl:177
  | [2] foreach(f::Base.var"#898#900", itr::Vector{Any})
  | @ Base ./abstractarray.jl:2712
  | [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{String})
  | @ Base ./asyncmap.jl:177
  | [4] wrap_n_exec_twice
  | @ ./asyncmap.jl:153 [inlined]
  | [5] #async_usemap#883
  | @ ./asyncmap.jl:103 [inlined]
  | [6] #asyncmap#882
  | @ ./asyncmap.jl:81 [inlined]
  | [7] pmap(f::Function, p::WorkerPool, c::Vector{String}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
  | @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:126
  | [8] pmap(f::Function, p::WorkerPool, c::Vector{String})
  | @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:101
  | [9] pmap(f::Function, c::Vector{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
  | @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
  | [10] pmap(f::Function, c::Vector{String})
  | @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
  | [11] macro expansion
  | @ /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/2157/calibrateedmf-ci/integration_tests/julia_parallel_test.jl:65 [inlined]
  | [12] macro expansion
  | @ ./timing.jl:220 [inlined]
  | [13] top-level scope
  | @ /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/2157/calibrateedmf-ci/integration_tests/julia_parallel_test.jl:61
  | in expression starting at /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/2157/calibrateedmf-ci/integration_tests/julia_parallel_test.jl:60
  | ┌ Warning: Forcibly interrupting busy workers
  | │   exception = rmprocs: pids [3, 4, 5, 6, 7, 8, 9, 10, 11] not terminated after 5.0 seconds.
  | └ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1249
  | ┌ Warning: rmprocs: process 1 not removed
  | └ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1045

An example PR where this error results in tests randomly failing is https://github.com/CliMA/CalibrateEDMF.jl/pull/440.

How do we fix this issue?

ilopezgp avatar Aug 22 '22 23:08 ilopezgp