Interactions between cmdstan instances through /proc
I have a weird issue on an HPC with SLURM where chains of one model crash when a different model that is running simultaneously finishes sampling. The weird part is that as a workaround I had to containerize the different models and exclude /proc from the bind paths of the container. The details of of the bug and workaround are described in this post: https://discourse.mc-stan.org/t/race-conditions-between-independent-cmdstan-model-runs/30918
I am not sure if this is something specfic to my environment, but given the surprising interaction through /proc, I thought it might be worth drawing attention to. Maybe someone has an idea for what could be causing such a bug.
I've experienced chains terminating unexpectedly (with no error messages or further information) when spawning multiple R processes that each use cmdstanr to sample from a different model (cmdstan 2.35.0, cmdstanr 0.8.1).
Sampling from each model in serial always succeeds, and running the same processes in parallel always results in at least one failure. As far as I can tell, it has nothing to do with the working directory or tempdir(). I haven't tried running the processes in separate containers with independent /proc directories. I don't have much experience with containers, but if I can find the time I'll give it a go and see whether it resolves this issue for me.
Sampling from multiple models in parallel succeeds when I create a new PID namespace for each process:
for model in ${MODELS}; do
unshare --fork --pid --mount-proc --user --map-root-user ./run.R "${model}" &
done