wandb-offline-sync-hook
wandb-offline-sync-hook copied to clipboard
Add a `frequency` parameter to `TriggerWandbSyncLightningCallback`
In order to synchronize every N epoch instead of every epoch (by default), I added a frequency parameter to the Lightning callback.
I did not write any test for that feature, mainly because I am not sure how you proceed (but also because of lazyness 😅) Tell me if you want me to add a test for that.
Review changes with SemanticDiff.
Analyzed 3 of 3 files.
Overall, the semantic diff is 51% smaller than the GitHub diff.
| Filename | Status | |
|---|---|---|
| :heavy_check_mark: | src/wandb_osh/hooks.py | 58.82% smaller |
| :heavy_check_mark: | src/wandb_osh/lightning_hooks.py | 26.87% smaller |
| :heavy_check_mark: | src/wandb_osh/ray_hooks.py | 74.45% smaller |
I like this, this would be an alternate way of solving #83 and much simpler than implementing multiprocessing.
As for tests, I think this is fine, I'd say it's simple enough :)
I made the changes. Still without testing 😇
I am 100% confident it works for the Lightning hook (TriggerWandbSyncLightningCallback), because I use such a setup in my work. However, I never used Ray before and can't think of an easy way to test it. Can you have a look and tell me if I implemented something stupid?
Also, about #83, I think it will definitely help but won't solve the problem (I can see that, when I run big sweeps, with 20 or 40 runs in parallel 😅).
I just realized that the hook() should also be called after the end of the last epoch, in case people use a weird sync_every_n_epochs. I don't think this is implemented yet.
I say this in case such thing happen:
- An run running for, say, 100, epochs but
sync_every_n_epochsset to 3 will miss to synchronize the last epoch. - Or (maybe more classic), a scheduler killing the run epoch 51 when
sync_every_n_epochsis set to 10 or even 5 or 2. In this case a few epoch won't be uploaded.
I see an easy fix for the Lightning callback; adding the on_test_epoch_end() hook with a call to TriggerWandbSyncHook that will happen 100% of the time (e.g., by setting current-epoch = sync_every_n_epochs, that's cheesy but that will work).
But, again, I have no clue about Ray 😅
Oh you're 100% right, I should have thought about that as well. A totally different avenue of solution (that might also help with syncing hyperparameters of runs that get killed because of #100) would be that the wandb-osh syncer tool remembers which runs were syncing and if it doesn't get triggered for an hour, it runs one final sync call (and we can catch KeyboardInterrupt to make sure we also do the final round if people kill wandb-osh. Of course that's also just another silly workaround.
Let me think about that. We might in fact do both (and perhaps add a note to the ray one that it has this disadvantage).
That would be a pretty robust workaround indeed.
Should I add the on_test_epoch_end() fix I mentionned and we close this PR?
Or should I just add a Warning about this problem in the doc, and we fix it in another PR?
I dug into the problem I raised in #100 and this is deeper that just not syncing the config. i will detail what I found so far in the Issue. Anyway, it would be handy for me to have the possibility to reduce the number of Triggers to investigate when the problem happens.
@klieret should I push this temporary fix? Or do you want to solve it in this PR?
Sorry for my late replies! Let me take a look now :)
Could you give me push permission on the branch (I made some fixes and added some tests to this). You only have to click a checkbox on this PR, I think
Could you give me push permission on the branch (I made some fixes and added some tests to this). You only have to click a checkbox on this PR, I think
The box is checked, it seems to be checked by default btw
Hi @klieret, any update about this PR?