mi-prometheus Introduce mutex-based experiment configuration to Grid Workers GPU

Introduce mutex-based experiment configuration to Grid Workers GPU

Open tkornuta-ibm opened this issue 6 years ago • 0 comments

Grid Trainers/Testers on GPU have hardcoded sleep time (currently 3s). This is motivated by the fact that cuda-gpupick picks a free GPU only by checking the contexts running on a given device.

The problem is that loading the configuration/configuring a given experiment might take longer than 3 seconds. This is the situation that we have faced with training of multiple models of MAC/SMAC on CLEVR/CoGenT.

For now we have increased the sleep time to 60 seconds (Closes #29 )

Desired solution

introduce a "configuration_in_progress" mutex to both basic and grid workers
when a basic worker starts, it raises the "configuration_in_progress" mutex
after spanning the process grid workers hangs on the "configuration_in_progress"
after the setup_configuration() method is finished, given basic workers lowers the "configuration_in_progress" mutex, that frees the grid worker to proceed (and potentially span next worker)

Oct 29 '18 23:10 tkornuta-ibm

mi-prometheus mi-prometheus copied to clipboard

Introduce mutex-based experiment configuration to Grid Workers GPU

mi-prometheus
mi-prometheus copied to clipboard