mi-prometheus icon indicating copy to clipboard operation
mi-prometheus copied to clipboard

Introduce mutex-based experiment configuration to Grid Workers GPU

Open tkornuta-ibm opened this issue 6 years ago • 0 comments

Grid Trainers/Testers on GPU have hardcoded sleep time (currently 3s). This is motivated by the fact that cuda-gpupick picks a free GPU only by checking the contexts running on a given device.

The problem is that loading the configuration/configuring a given experiment might take longer than 3 seconds. This is the situation that we have faced with training of multiple models of MAC/SMAC on CLEVR/CoGenT.

For now we have increased the sleep time to 60 seconds (Closes #29 )

Desired solution

  1. introduce a "configuration_in_progress" mutex to both basic and grid workers
  2. when a basic worker starts, it raises the "configuration_in_progress" mutex
  3. after spanning the process grid workers hangs on the "configuration_in_progress"
  4. after the setup_configuration() method is finished, given basic workers lowers the "configuration_in_progress" mutex, that frees the grid worker to proceed (and potentially span next worker)

tkornuta-ibm avatar Oct 29 '18 23:10 tkornuta-ibm