NeuralPDE.jl icon indicating copy to clipboard operation
NeuralPDE.jl copied to clipboard

[WIP] Hyperparameter sweeps and experiment manager

Open zoemcc opened this issue 2 years ago • 4 comments

Very WIP, I'm pulling in and merging prototypes of the various components into the main repo.

This depends on the adaptive reweighting stuff and so that should be finalized and merged first.

zoemcc avatar Feb 16 '22 22:02 zoemcc

This works! I want to do some cleanup of hardcoded log locations and make sure the interface is right before we merge and release it, so I'll request a review next week sometime (and actually add real tests to it instead of just a test case). But I have full hyperparameter sweep & experiment manager running on a simple test pde_system case, locally. I haven't tested with remote workers yet but it shouldn't be any different.

This assumes that the remote environments are set up with the correct packages and things by the user, not programmatically by the experiment manager. I'm not sure how to easily do that programmatically, to be honest. I'll probably try to figure that out on Supercloud so that it's easy for other users to set up. Something like scping a bunch of directories related to julia environments/packages?

The current setup runs a main loop that allocates hyperparameters to be run as experiments asynchronously to the workers, and then checks each worker to see if their experiment is finished. If it is finished, it grab their tensorboard logs as a vector of file locations and file contents, and then save them locally, and then remove the experiment in progress from the data structure.

Here is a pretty logs plot that I get out as aggregate from the test case (64 runs of different hyperparameters). hyperparameter_run_first

zoemcc avatar Feb 19 '22 00:02 zoemcc

This looks nice. Let's have @AnasAbdelR do a review after the other PR merges and this diff is cleaned up a bit by rebase to the new master.

ChrisRackauckas avatar Feb 28 '22 00:02 ChrisRackauckas

Is this going to merge? This is going to keep getting conflicts. Break it down.

ChrisRackauckas avatar Jun 23 '22 11:06 ChrisRackauckas

Yeah I'll clean it up and make it ready to merge after this next sprint.

zoemcc avatar Jun 23 '22 18:06 zoemcc