Tim Moon

Results 227 comments of Tim Moon

Merging with approval from @ptrendx and @ksivaman.

Logically, `scale` is part of the recipe and `scale_inv` is part of the data: - You can change `scale` at any time, for whatever reason, to any value. It is...

For better or for worse, I think "AdamW" now refers to the LR-coupled version. In addition to [PyTorch](https://github.com/pytorch/pytorch/blob/d921891f5788b37ea92eceddf7417d11e44290e6/torch/optim/_functional.py#L125) and [JAX](https://github.com/google-deepmind/optax/blob/2cdb89cc4935d8dc5c8a06344e7d50dc7a7419b0/optax/_src/alias.py#L640), I see this formulation in [Keras](https://github.com/keras-team/keras/blob/8f5592bcb61ff48c96560c8923e482db1076b54a/keras/src/optimizers/base_optimizer.py#L828C44-L829C1) (and therefore [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/AdamW)), [PaddlePaddle](https://github.com/PaddlePaddle/Paddle/blob/2aebcd88c95ac8c2d917f6485c295f8837f71131/python/paddle/optimizer/adamw.py#L66),...

Can you provide a minimal reproducer? The following runs for me: ```python import torch import transformer_engine.pytorch as te # Options batch_size = 128 hidden_size = 128 dtype = torch.float32 device...

Most of our kernels don't handle `Float8Tensor` directly. Also, our RMSNorm kernel doesn't support FP8 input at the moment, just FP8 output. As a quick fix, you could manually cast...