machinelearning AutoML 2.0: Distributes learning / potential "hack" using checkpoints CSV on a network drive?

AutoML 2.0: Distributes learning / potential "hack" using checkpoints CSV on a network drive?

Open torronen opened this issue 2 years ago • 2 comments

I could not find docs or plans about distributed learning. If that is correct, any ideas for how it could be implemented or links to common approaches?

AutoML 2.0 seems to have a nice checkpoint CSV file. Would it be possible to make a "hack" using the checkpoint file? Would there a tuner that would allow me to use it for distributed tuning? Or, perhaps setting different MLContext seed on each machine would do it?

I am thinking something like this might work with minimal source editing:

Checkpoint file to a network drive
Make each experiment run only 1 trial, then restart (so the current status gets read) - or edit source to re-read the checkpoint file
Multiple clients would run simultaneously

Potential problems I could expect:

Some type of randomity is probably needed from tuner, otherwise all clients might run the trial on same parameters
Probably needs to add locks for saving to CSV

May 03 '23 18:05 torronen

@LittleLittleCloud thoughts on this?

May 03 '23 21:05 michaelgsharp

That's an interesting idea. The checkpoint file will be used to sync status among training clients. The remaining working items for clients and orchestrator to do are

clients: accepts latest checkpoint file and training/validation dataset from orchestrator, uploading training artifacts
orchestrator: update checkpoint file according to client's checkpoint file, collecting training artifacts from clients and start next round once all clients finish current training round.

For randomness, we probably want to use different MLContext seed for different clients. For adding lock to .csv, I personally feel it might not be necessary? As client can save their .csv in different names (e.g. 'checkpoint-{client-name}.csv`. When orchestrator collecting clients' checkpoints, it can simply read all checkpoints and accumulate them to be a new checkpoint to start.

Jan 24 '24 18:01 LittleLittleCloud

machinelearning machinelearning copied to clipboard

AutoML 2.0: Distributes learning / potential "hack" using checkpoints CSV on a network drive?

machinelearning
machinelearning copied to clipboard