test-tube
test-tube copied to clipboard
Experiment version race condition error when using slurm
Sometimes, there's a chance test-tube will try to create an experiment version which already exists. Need to add a small delay to avoid the race condition.
A small delay would not be a proper fix for a race condition.
I ran into this same problem. The workaround I found is to set the Experiment.version
attribute to the value of the --hpc_exp_number
argument that gets passed to the script when it's called from SlurmCluster.optimize_parallel_cluster_gpu()
. Since the next_trial_version
is read from a single process before the sbatch scripts are enqueued to run in parallel, it won't hit the race condition.
So, for example, in the pytorch_hpc_example, I'd add between lines 41-42:
parser.add_argument('--hpc_exp_number', type=int)
And then, between lines 18-19:
version=hparams.hpc_exp_number
There's probably a better way that handles this automatically, but in the meantime this is the solution I found. I'll open a PR if I find a better way to do it. What do you think @williamFalcon?
Anyway, I hope this helps!