test-tube icon indicating copy to clipboard operation
test-tube copied to clipboard

Experiment version race condition error when using slurm

Open williamFalcon opened this issue 6 years ago • 2 comments

Sometimes, there's a chance test-tube will try to create an experiment version which already exists. Need to add a small delay to avoid the race condition.

williamFalcon avatar Nov 30 '18 23:11 williamFalcon

A small delay would not be a proper fix for a race condition.

artyompal avatar Dec 24 '18 18:12 artyompal

I ran into this same problem. The workaround I found is to set the Experiment.version attribute to the value of the --hpc_exp_number argument that gets passed to the script when it's called from SlurmCluster.optimize_parallel_cluster_gpu(). Since the next_trial_version is read from a single process before the sbatch scripts are enqueued to run in parallel, it won't hit the race condition.

So, for example, in the pytorch_hpc_example, I'd add between lines 41-42:

parser.add_argument('--hpc_exp_number', type=int)

And then, between lines 18-19:

version=hparams.hpc_exp_number

There's probably a better way that handles this automatically, but in the meantime this is the solution I found. I'll open a PR if I find a better way to do it. What do you think @williamFalcon?

Anyway, I hope this helps!

oscmansan avatar Aug 17 '19 16:08 oscmansan