Junjie ZHANG issues

Results 6 issues of


                                            Junjie ZHANG

[QUESTION] About all_reduce order while using CP

I noticed that [CP grad is also reduced in DDP](https://github.com/NVIDIA/Megatron-LM/blob/1585be2ab23de84cd5e2fa6d3973053c32eabf48/megatron/core/distributed/distributed_data_parallel.py#L453), indicating that [grads are first reduced among micro batches then the CP group](https://github.com/NVIDIA/Megatron-LM/blob/1585be2ab23de84cd5e2fa6d3973053c32eabf48/megatron/core/pipeline_parallel/schedules.py#L490), thereby minimizing communication costs. However, the reduction...

[Paper BUG] Conflict between Figure 3, formula 21 and formula 22

The conflict is that Figure 3 and formula 22 indicate the input of $$TRM_k$$ is T-K token (k:T-k), while formula 22 indicates the input of $$TRM_k$$ is T-k token (1:T-k)....

stale

[Experimental Feature] Huggingface model training

Hi, as discussed in https://github.com/pytorch/torchtitan/issues/903. This PR includes features of training a llama model from HF directly using “AutoModelForCausalLM” and loading safetensors (hf weights) in an online sharding manner. *...

CLA Signed

[Possible PR discuss] Will a PR of training HF model be welcomed?

Hi! We are in the process of developing a novel training framework for Reinforcement Learning (RL) following TorchTitan. Recently, we've developed a feature to support direct training from Hugging Face...

huggingface integration

community help wanted

feat: remove recordStream of normal mode

# Motivation A possible way to remove record stream of normal mode by holding reference. Issue see #455 . # Test results Test on downstream training task, no problems encountered....

Maybe use reference stash to replace record stream to reduce mem peak

Due to behavior of CudaCacheAllocator, record stream will lead to a late memory free, which has significant on memory peak. (Could refer to [fsdp1's issue due to recordstream](https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486)). In Pytorch...