wandb-offline-sync-hook Add a `frequency` parameter to `TriggerWandbSyncLightningCallback`

In order to synchronize every N epoch instead of every epoch (by default), I added a frequency parameter to the Lightning callback.

I did not write any test for that feature, mainly because I am not sure how you proceed (but also because of lazyness 😅) Tell me if you want me to add a test for that.

Mar 22 '24 13:03 CedricLeon

Review changes with SemanticDiff.

Analyzed 3 of 3 files.

Overall, the semantic diff is 51% smaller than the GitHub diff.

	Filename	Status
:heavy_check_mark:	src/wandb_osh/hooks.py	58.82% smaller
:heavy_check_mark:	src/wandb_osh/lightning_hooks.py	26.87% smaller
:heavy_check_mark:	src/wandb_osh/ray_hooks.py	74.45% smaller

Mar 22 '24 13:03 semanticdiff-com[bot]

I like this, this would be an alternate way of solving #83 and much simpler than implementing multiprocessing.

As for tests, I think this is fine, I'd say it's simple enough :)

Mar 22 '24 16:03 klieret

I made the changes. Still without testing 😇 I am 100% confident it works for the Lightning hook (TriggerWandbSyncLightningCallback), because I use such a setup in my work. However, I never used Ray before and can't think of an easy way to test it. Can you have a look and tell me if I implemented something stupid? Also, about #83, I think it will definitely help but won't solve the problem (I can see that, when I run big sweeps, with 20 or 40 runs in parallel 😅).

Mar 25 '24 13:03 CedricLeon

I just realized that the hook() should also be called after the end of the last epoch, in case people use a weird sync_every_n_epochs. I don't think this is implemented yet. I say this in case such thing happen:

An run running for, say, 100, epochs but sync_every_n_epochs set to 3 will miss to synchronize the last epoch.
Or (maybe more classic), a scheduler killing the run epoch 51 when sync_every_n_epochs is set to 10 or even 5 or 2. In this case a few epoch won't be uploaded.

I see an easy fix for the Lightning callback; adding the on_test_epoch_end() hook with a call to TriggerWandbSyncHook that will happen 100% of the time (e.g., by setting current-epoch = sync_every_n_epochs, that's cheesy but that will work). But, again, I have no clue about Ray 😅

Mar 25 '24 14:03 CedricLeon

Oh you're 100% right, I should have thought about that as well. A totally different avenue of solution (that might also help with syncing hyperparameters of runs that get killed because of #100) would be that the wandb-osh syncer tool remembers which runs were syncing and if it doesn't get triggered for an hour, it runs one final sync call (and we can catch KeyboardInterrupt to make sure we also do the final round if people kill wandb-osh. Of course that's also just another silly workaround.

Let me think about that. We might in fact do both (and perhaps add a note to the ray one that it has this disadvantage).

Mar 25 '24 15:03 klieret

That would be a pretty robust workaround indeed. Should I add the on_test_epoch_end() fix I mentionned and we close this PR? Or should I just add a Warning about this problem in the doc, and we fix it in another PR?

I dug into the problem I raised in #100 and this is deeper that just not syncing the config. i will detail what I found so far in the Issue. Anyway, it would be handy for me to have the possibility to reduce the number of Triggers to investigate when the problem happens.

Mar 26 '24 10:03 CedricLeon

@klieret should I push this temporary fix? Or do you want to solve it in this PR?

Mar 28 '24 08:03 CedricLeon

Sorry for my late replies! Let me take a look now :)

Apr 01 '24 22:04 klieret

Could you give me push permission on the branch (I made some fixes and added some tests to this). You only have to click a checkbox on this PR, I think

Apr 01 '24 22:04 klieret

Could you give me push permission on the branch (I made some fixes and added some tests to this). You only have to click a checkbox on this PR, I think

The box is checked, it seems to be checked by default btw

Apr 04 '24 16:04 CedricLeon

Hi @klieret, any update about this PR?

Apr 18 '24 13:04 CedricLeon

wandb-offline-sync-hook wandb-offline-sync-hook copied to clipboard

Add a `frequency` parameter to `TriggerWandbSyncLightningCallback`

wandb-offline-sync-hook
wandb-offline-sync-hook copied to clipboard