Junjie ZHANG

Results 6 issues of Junjie ZHANG

I noticed that [CP grad is also reduced in DDP](https://github.com/NVIDIA/Megatron-LM/blob/1585be2ab23de84cd5e2fa6d3973053c32eabf48/megatron/core/distributed/distributed_data_parallel.py#L453), indicating that [grads are first reduced among micro batches then the CP group](https://github.com/NVIDIA/Megatron-LM/blob/1585be2ab23de84cd5e2fa6d3973053c32eabf48/megatron/core/pipeline_parallel/schedules.py#L490), thereby minimizing communication costs. However, the reduction...

The conflict is that Figure 3 and formula 22 indicate the input of $$TRM_k$$ is T-K token (k:T-k), while formula 22 indicates the input of $$TRM_k$$ is T-k token (1:T-k)....

stale

Hi, as discussed in https://github.com/pytorch/torchtitan/issues/903. This PR includes features of training a llama model from HF directly using “AutoModelForCausalLM” and loading safetensors (hf weights) in an online sharding manner. *...

CLA Signed

Hi! We are in the process of developing a novel training framework for Reinforcement Learning (RL) following TorchTitan. Recently, we've developed a feature to support direct training from Hugging Face...

huggingface integration
community help wanted

# Motivation A possible way to remove record stream of normal mode by holding reference. Issue see #455 . # Test results Test on downstream training task, no problems encountered....

Due to behavior of CudaCacheAllocator, record stream will lead to a late memory free, which has significant on memory peak. (Could refer to [fsdp1's issue due to recordstream](https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486)). In Pytorch...