FedScale icon indicating copy to clipboard operation
FedScale copied to clipboard

Fix async

Open fanlai0990 opened this issue 2 years ago • 1 comments

Why are these changes needed?

  1. Model testing is somehow missing; 2. Weird model accuracy over training;

Related issue number

Checks

  • [ ] I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
  • [ ] I've made sure the following tests are passing.
  • Testing Configurations
    • [ ] Dry Run (20 training rounds & 1 evaluation round)
    • [ ] Cifar 10 (20 training rounds & 1 evaluation round)
    • [ ] Femnist (20 training rounds & 1 evaluation round)

fanlai0990 avatar Aug 12 '22 07:08 fanlai0990

Thank you for the fix!

ewenw avatar Aug 12 '22 19:08 ewenw

Thank you for the fix!

Thanks a lot for your feedback! This PR is still WIP, since we notice the weird accuracy issue has not been well addressed (although I almost rewrote the entire async). I will fix it and your comments asap. Thanks for your patience!

fanlai0990 avatar Aug 15 '22 16:08 fanlai0990

Hi @AmberLJC do you have any tensorboard results for async handy? It would be helpful to attach it with the PR. Thanks!

ewenw avatar Aug 24 '22 17:08 ewenw

Hi @AmberLJC do you have any tensorboard results for async handy? It would be helpful to attach it with the PR. Thanks!

image

AmberLJC avatar Aug 24 '22 18:08 AmberLJC

Do you know how the convergence over virtual clock time compares with synchronous training with similar parameters?

ewenw avatar Aug 24 '22 20:08 ewenw

Do you know how the convergence over virtual clock time compares with synchronous training with similar parameters?

Let me try it out. I previously had some results, but I want to make a fair comparison again.

AmberLJC avatar Aug 24 '22 20:08 AmberLJC

@AmberLJC Can you please help to review and test the last commit? Training gets stuck as the model (version) is missing on executors. Thanks!

fanlai0990 avatar Sep 02 '22 03:09 fanlai0990