Kevin Yin comments

Results 40 comments of


                                            Kevin Yin

A way to set "Group by" to "None" across many runs, and have that hold for new runs

Same thing holds for Smoothing, and Hover: [X/X Unified/Closest]... So I guess the real request is, all the Tasks in a Project have similar properties, and I would like configurations...

A way to set "Group by" to "None" across many runs, and have that hold for new runs

After exploring further, I've been able to clarify my thoughts. In ClearML, the charts in each task in each project are considered separately from the charts in every other task....

[FR] Introduce shadow curves to line smoothness metrics plot

Wandb has this feature. It is important in spotting gradient instability, cyclic behaviors, and spikes in runtime.

Stochastic rounding in bfloat16

Why does the fp32 -> bf16 autocast run have such poor loss curves? Shouldn't fp32 master weights + bf16 model weights at least be better than bf16 + stochastic rounding?

[FSDP2] Eager-Mode Execution Tracker

fp16 is used for diffusion models, so the lack of gradient scaler is a blocker there. However, https://github.com/pytorch/pytorch/pull/116054 seems relevant, which would complicate work.

[FSDP2] Eager-Mode Execution Tracker

Update: My own fp16 grad scaler is 50 lines. It doesn't handle the fancy mixed-device things that the PyTorch grad scalar does. It's somewhat faster by avoiding unnecessary kernel launches...

[FSDP2] Eager-Mode Execution Tracker

Currently I use `to_local()` in a no-grad context, then `torch._foreach_norm`, then `torch.stack()`, then manually square and all-reduce the norms in one go. There was some issue with calling `torch._foreach_norm` on...

[FSDP2] Eager-Mode Execution Tracker

I tested `_foreach_norm(DTensor)`; the prior error disappeared. It's slower than using local tensors however. https://api.wandb.ai/links/novelaix/va6xg9g8 Interestingly, the `train/overhead` panel also shows a slowdown - but this measures time doing Python...

Documentation request for Protein

https://x.com/jsuarez5341/status/1938287195305005500

[checkpoint] Fix get_optimizer_state_dict modifying optimizer state

This patch doesn't make sense in my opinion. Optimizers have more state than `step`, and I doubt 0 is the correct step to revert to.