torchtitan
torchtitan copied to clipboard
only produce tensorboard logs on rank 0 by default
Stack from ghstack (oldest at bottom):
- -> #339
-
For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.
-
Remove
torch
dependency inrequirements.txt
as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.
not sure why the 1D compile test is failing...