mi-prometheus
mi-prometheus copied to clipboard
Introduce mutex-based experiment configuration to Grid Workers GPU
Grid Trainers/Testers on GPU have hardcoded sleep time (currently 3s). This is motivated by the fact that cuda-gpupick picks a free GPU only by checking the contexts running on a given device.
The problem is that loading the configuration/configuring a given experiment might take longer than 3 seconds. This is the situation that we have faced with training of multiple models of MAC/SMAC on CLEVR/CoGenT.
For now we have increased the sleep time to 60 seconds (Closes #29 )
Desired solution
- introduce a "configuration_in_progress" mutex to both basic and grid workers
- when a basic worker starts, it raises the "configuration_in_progress" mutex
- after spanning the process grid workers hangs on the "configuration_in_progress"
- after the setup_configuration() method is finished, given basic workers lowers the "configuration_in_progress" mutex, that frees the grid worker to proceed (and potentially span next worker)