platoon icon indicating copy to clipboard operation
platoon copied to clipboard

Problem with mini-batches in Controller and fixed nb of mb in Worker before sync.

Open mgermain opened this issue 8 years ago • 4 comments

In the case where the Controller manages the mini-batches but, the Worker decides when to sync with the global parameters, you can encounter the problem where the Worker is waiting for more mini-batches before doing a sync but none is available.

A possible fix for this would be to let the Controller decide when a Worker should sync.

mgermain avatar Jan 13 '16 16:01 mgermain

This seems like a convoluted and constructed scenario.

Having the minibatch dispatch and the controller in the same process should probably not even be supported since it is super slow anyway as the two tasks keep blocking each other due to the ZeroMQ design.

abergeron avatar Feb 04 '16 20:02 abergeron

I'm talking about a simple basic use case, in separate process and all.

Let say you have 22 mini-batches total and 2 workers that sync every 10 mini-batches. They each ask for the first 10 mb and sync, then ask for 1 or 2 and hang waiting for the next 8 mb before syncing. After the socket timeout, the worker will just crash.

There are ways around this but I think this should be easier to use or better-documented somehow.

mgermain avatar Feb 04 '16 20:02 mgermain

The minibatch server should not have a limited supply. Or if it is limited it should be enough to fully satisfy each worker.

I don't think we should support any other use case.

abergeron avatar Feb 04 '16 20:02 abergeron

So maybe only allow the minibatch server to send 10 minibatch at a time? so the last 2 mini batch won't be used?

We should at least document this limitation. I don't think it is a priority to have a better fix if the worker crash due to a timeout.

On Thu, Feb 4, 2016 at 4:01 PM, abergeron [email protected] wrote:

The minibatch server should not have a limited supply. Or if it is limited it should be enough to fully satisfy each worker.

I don't think we should support any other use case.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/issues/24#issuecomment-180047049.

nouiz avatar Feb 05 '16 15:02 nouiz