Tim Moon comments

Results 227 comments of


                                            Tim Moon

[PyTorch] Debug dtype casting in operation-based API

/te-ci pytorch

[PyTorch] Debug dtype casting in operation-based API

/te-ci pytorch

[PyTorch] Debug dtype casting in operation-based API

/te-ci pytorch

'TEDotProductAttention' object has no attribute 'tp_group_initialized'

Can you provide more information or a minimal reproducer? This error suggests that the tensor-parallel group has not been properly configured. If you are using one of [Megatron-LM's TE wrappers](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/custom_layers/transformer_engine.py),...

Discrepancy with Optimizer States and Model State Dict when using store_param_remainders==True

Great debugging. It's tricky that [rn rounding](https://docs.nvidia.com/cuda/floating-point/index.html#rounding-modes) is irreversible unless we store an extra bit, which seems excessive given that these errors are just at the level of machine epsilon....

Discrepancy with Optimizer States and Model State Dict when using store_param_remainders==True

Both `_bf16_rem_to_fp32` and the Adam kernel use "round to nearest, ties away from zero", so you should get bit-wise exact results when saving/loading state dicts. However, direct type casts (e.g....

Could not work , even use the official script

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that `fsdp.py` didn't print out the world size after initialization: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207...

Could not work , even use the official script

Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L205-L207 Differences I can see: - `python -m torch.distributed.launch` vs `torchrun` -...

FSDP: How to do all-gather using FP8?

Adding to this, FSDP support should just be a matter of implementing `fsdp_pre_all_gather` and `fsdp_post_all_gather` methods in `Float8Tensor`, at least in principle.

importlib.metadata.PackageNotFoundError: transformer-engine

- CMake is unable to find a C++ compiler in the usual places (e.g. `/usr/bin/c++`). Try setting `CXX` in the environment to the path of your compiler. We usually build...