torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

only produce tensorboard logs on rank 0 by default

Open tianyu-l opened this issue 9 months ago • 1 comments

Stack from ghstack (oldest at bottom):

  • -> #339
  1. For tensorboard metrics, we mostly care about loss, memory, wps/mfu. Loss is all-reduced so will be the same on all ranks; other metrics are likely to be very similar among all ranks. So by default it suffices to only do tb logging on rank 0 -- the straggler effect should be small for tb writes. User could always toggle on all-rank logging for debugging purposes.

  2. Remove torch dependency in requirements.txt as it cannot work alone / is not used anyways. Currently we are suggesting users to install latest nightly in README, and do so in all the CI tests.

tianyu-l avatar May 16 '24 22:05 tianyu-l

not sure why the 1D compile test is failing...

tianyu-l avatar May 17 '24 22:05 tianyu-l