Mathieu Germain
Mathieu Germain
Most of this really need to be tought in parallel with the JobManager #91. - [ ] Only give the base path to the command manager and let it handle...
This will allow us to add feature such as #10, #93 and more. Status of a batch can be (Running, Done, Stopped) Here an Idea of what the option tree...
Create a JobManager in the style of the CommandManager that include probably the functionalities of job_generator/job_generator_factory and some stuff from scripts/smart_dispatch.py. This might be a bigger task than it seems,...
As eluded to in #86 add the possibility to manage queues. - queue - info (QNAME | All) - add QNAME CORESPERNODE GPUSPERNODE RAMPERNODE MAXWALLTIME DEFAULTMODULES NODESINQ MINPPN - delete...
If the user code does not have built-in resume the current behaviour will be a problem. The default behaviour of the worker should be to run one command and then...
We should check if the controller actually launched before starting the workers. Imagine the case you have a controller already running on the same port you are trying to use,...
In the case where the Controller manages the mini-batches but, the Worker decides when to sync with the global parameters, you can encounter the problem where the Worker is waiting...
If you simply use `gpu` multiple time when using the launcher `platoon-launcher exp gpu gpu gpu` all the worker will try to output in the same file.
In the controller, the process sending mini-batches terminate as soon as it is done sending batches, destroying the buffer in the process and dropping the last few mini-batches. I tried...
It's probably not a real issue but, Platoon will not work on Windows because we are using `posix_ipc` which is not compatible and I think the way we use `cffi`...