Tim Moon

Results 227 comments of Tim Moon

Can you provide more information or a minimal reproducer? This error suggests that the tensor-parallel group has not been properly configured. If you are using one of [Megatron-LM's TE wrappers](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/custom_layers/transformer_engine.py),...

Great debugging. It's tricky that [rn rounding](https://docs.nvidia.com/cuda/floating-point/index.html#rounding-modes) is irreversible unless we store an extra bit, which seems excessive given that these errors are just at the level of machine epsilon....

Both `_bf16_rem_to_fp32` and the Adam kernel use "round to nearest, ties away from zero", so you should get bit-wise exact results when saving/loading state dicts. However, direct type casts (e.g....

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that `fsdp.py` didn't print out the world size after initialization: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207...

Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L205-L207 Differences I can see: - `python -m torch.distributed.launch` vs `torchrun` -...

Adding to this, FSDP support should just be a matter of implementing `fsdp_pre_all_gather` and `fsdp_post_all_gather` methods in `Float8Tensor`, at least in principle.

- CMake is unable to find a C++ compiler in the usual places (e.g. `/usr/bin/c++`). Try setting `CXX` in the environment to the path of your compiler. We usually build...