Junjie Mao
Junjie Mao
> Thanks for the detailed testing—super helpful! These errors match known PyTorch issues with Compiled Autograd + distributed/mixed precision: > > 1. Eager/bfloat16: This is a known PyTorch bug in...
@stas00 Following your suggestion, I just created this issue to focus on discussions on the DeviceMesh topic. Please review and feel free to comment.
@stas00 Thanks for the comments! That helps me understand the problem better. Indeed `DeviceMesh` does not provide flexibility for our usage and we must either extend it or base ourselves...
> > mpu provides a series of set_xx APIs for setting world sizes or ranks without touching any created groups. I was wondering what are their primary use cases. >...
@stas00 We can take a closer look at refreshing process group management in the next couple of months. May I know if you have any detailed expectation on the refreshed...
@tohtana Any idea if this two-grad phenomenon is expected? If so, should we add a None check at the beginning of `_backward_prologue_per_tensor`?
> Thank you for reporting, [@eternalNight](https://github.com/eternalNight)! I didn't expect the case. Do you think we can simply skip the scaling when the given value is None? @tohtana I'll investigate why...
@tohtana Here's the story: 1. Llama2 returns a `CausalLMOutputWithPast` (which extends `dict`) which contains a loss tensor (of size 1) and a logits tensor. Deepspeed registers the backward hook on...
The `self._parameters['cls_token']' rank mismatch` reason of the failure could be misleading. Even with a model that does compile (e.g., https://gist.github.com/eternalNight/89ad0639abba0d51ca7777a91d0b07a0 **with line 39-40 commented out**), guard `check `succeeds but `check_verbose`...
> Hi there, so is the question fixed or any instruction to avoid? I don't find an opportunity to dig into torch graph guard logic for the root cause yet....