Kevin Yin
Kevin Yin
Same thing holds for Smoothing, and Hover: [X/X Unified/Closest]... So I guess the real request is, all the Tasks in a Project have similar properties, and I would like configurations...
After exploring further, I've been able to clarify my thoughts. In ClearML, the charts in each task in each project are considered separately from the charts in every other task....
Wandb has this feature. It is important in spotting gradient instability, cyclic behaviors, and spikes in runtime.
Why does the fp32 -> bf16 autocast run have such poor loss curves? Shouldn't fp32 master weights + bf16 model weights at least be better than bf16 + stochastic rounding?
fp16 is used for diffusion models, so the lack of gradient scaler is a blocker there. However, https://github.com/pytorch/pytorch/pull/116054 seems relevant, which would complicate work.
Update: My own fp16 grad scaler is 50 lines. It doesn't handle the fancy mixed-device things that the PyTorch grad scalar does. It's somewhat faster by avoiding unnecessary kernel launches...
Currently I use `to_local()` in a no-grad context, then `torch._foreach_norm`, then `torch.stack()`, then manually square and all-reduce the norms in one go. There was some issue with calling `torch._foreach_norm` on...
I tested `_foreach_norm(DTensor)`; the prior error disappeared. It's slower than using local tensors however. https://api.wandb.ai/links/novelaix/va6xg9g8 Interestingly, the `train/overhead` panel also shows a slowdown - but this measures time doing Python...
https://x.com/jsuarez5341/status/1938287195305005500
This patch doesn't make sense in my opinion. Optimizers have more state than `step`, and I doubt 0 is the correct step to revert to.