CalibrateEDMF.jl
CalibrateEDMF.jl copied to clipboard
Buildkite tests fail randomly due to errors when using Distributed
Many of the buildkite tests now fail randomly due to an error that originates in the use of Distributed.
An example of this error is:
Stacktrace:
--
| [1] (::Base.var"#898#900")(x::Task)
| @ Base ./asyncmap.jl:177
| [2] foreach(f::Base.var"#898#900", itr::Vector{Any})
| @ Base ./abstractarray.jl:2712
| [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{String})
| @ Base ./asyncmap.jl:177
| [4] wrap_n_exec_twice
| @ ./asyncmap.jl:153 [inlined]
| [5] #async_usemap#883
| @ ./asyncmap.jl:103 [inlined]
| [6] #asyncmap#882
| @ ./asyncmap.jl:81 [inlined]
| [7] pmap(f::Function, p::WorkerPool, c::Vector{String}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
| @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:126
| [8] pmap(f::Function, p::WorkerPool, c::Vector{String})
| @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:101
| [9] pmap(f::Function, c::Vector{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
| @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
| [10] pmap(f::Function, c::Vector{String})
| @ Distributed /central/software/julia/1.7.3/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
| [11] macro expansion
| @ /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/2157/calibrateedmf-ci/integration_tests/julia_parallel_test.jl:65 [inlined]
| [12] macro expansion
| @ ./timing.jl:220 [inlined]
| [13] top-level scope
| @ /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/2157/calibrateedmf-ci/integration_tests/julia_parallel_test.jl:61
| in expression starting at /central/scratch/esm/slurm-buildkite/calibrateedmf-ci/2157/calibrateedmf-ci/integration_tests/julia_parallel_test.jl:60
| ┌ Warning: Forcibly interrupting busy workers
| │ exception = rmprocs: pids [3, 4, 5, 6, 7, 8, 9, 10, 11] not terminated after 5.0 seconds.
| └ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1249
| ┌ Warning: rmprocs: process 1 not removed
| └ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1045
An example PR where this error results in tests randomly failing is https://github.com/CliMA/CalibrateEDMF.jl/pull/440.
How do we fix this issue?