machinelearning
machinelearning copied to clipboard
AutoML 2.0: Distributes learning / potential "hack" using checkpoints CSV on a network drive?
I could not find docs or plans about distributed learning. If that is correct, any ideas for how it could be implemented or links to common approaches?
AutoML 2.0 seems to have a nice checkpoint CSV file. Would it be possible to make a "hack" using the checkpoint file? Would there a tuner that would allow me to use it for distributed tuning? Or, perhaps setting different MLContext seed on each machine would do it?
I am thinking something like this might work with minimal source editing:
- Checkpoint file to a network drive
- Make each experiment run only 1 trial, then restart (so the current status gets read) - or edit source to re-read the checkpoint file
- Multiple clients would run simultaneously
Potential problems I could expect:
- Some type of randomity is probably needed from tuner, otherwise all clients might run the trial on same parameters
- Probably needs to add locks for saving to CSV
@LittleLittleCloud thoughts on this?
That's an interesting idea. The checkpoint file will be used to sync status among training clients. The remaining working items for clients and orchestrator to do are
- clients: accepts latest checkpoint file and training/validation dataset from orchestrator, uploading training artifacts
- orchestrator: update checkpoint file according to client's checkpoint file, collecting training artifacts from clients and start next round once all clients finish current training round.
For randomness, we probably want to use different MLContext seed for different clients. For adding lock to .csv, I personally feel it might not be necessary? As client can save their .csv in different names (e.g. 'checkpoint-{client-name}.csv`. When orchestrator collecting clients' checkpoints, it can simply read all checkpoints and accumulate them to be a new checkpoint to start.