issue-tracking icon indicating copy to clipboard operation
issue-tracking copied to clipboard

Enable Offline ExistingExperiment

Open gauchm opened this issue 4 years ago • 6 comments

I am running my scripts on a SLURM-scheduled cluster where the compute nodes don't have internet access.

The training script works just fine: I can use an OfflineExperiment. But the subsequent test script (which also doesn't have internet access) is a problem: I'd like to continue the experiment from training, which I would normally do with an ExistingExperiment. But even if I upload the OfflineExperiment after training, I can't create an ExistingExperiment without internet connection.

tl;dr: I need an "OfflineExistingExperiment".

gauchm avatar Apr 17 '20 07:04 gauchm

@gauchm Thanks for the report. I think that you can continue training with another OfflineExperiment, forcing the experiment key to be the previous one by using the COMET_EXPERIMENT_KEY config variable, and, if uploading, using the comet upload --force-reupload ... I'm not sure about this, and you should test to see if their are any side-effects. If that doesn't work (or has any bad side-effects) let us know, and we can find a solution.

dsblank avatar Apr 17 '20 11:04 dsblank

I tried your suggestion like this:

ex = comet_ml.OfflineExperiment(offline_directory='/tmp')
ex.log_other('asdf', 123)
print(ex.get_key())  # prints key like 2f492...
ex.end()

then export COMET_EXPERIMENT_KEY=2f492... then

ex = comet_ml.OfflineExperiment(offline_directory='/tmp')
ex.log_other('qwer', 789)
ex.end()

What seems to happen is that the second experiment just overwrites the first, rather than continuing it. If I examine the created zip-file via comet offline 2f492...zip, there's an entry for qwer, but none for asdf.

If I upload the first experiment between experiments and do a comet upload --force-reupload, the result is the same: The second experiment overwrites the first.

gauchm avatar Apr 17 '20 11:04 gauchm

Thanks for trying this.

I'm making a issue for this so we can work on a solution.

dsblank avatar Apr 17 '20 11:04 dsblank

It looks like the data from the continuing experiment overwrites the first because the step values are repeats. Is it possible for you to add an offset to your steps so that they can pick up where they left off?

dsblank avatar Apr 23 '20 18:04 dsblank

I added an offset via ex.set_step(), but it doesn't seem to help:

  • If I upload the first experiment before starting the second and then reupload with --force-reupload, the second experiment shows up as its own entry with a new experiment_key (even though I forced it with COMET_EXPERIMENT_KEY).
  • Looking at the zip file through comet offline after finishing the second experiment, I can see that whatever I logged in the first experiment is no longer there.

gauchm avatar Apr 28 '20 08:04 gauchm

Thank you for your report. Unfortunately, there is no way of having an OfflineExistingExperiment as of today.

We added your request to our roadmap and will keep you posted when we have a solution for it.

Lothiraldan avatar Apr 30 '20 13:04 Lothiraldan

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 10 '23 21:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Nov 15 '23 21:11 github-actions[bot]