Junjie Mao

Results 22 comments of Junjie Mao

> Thanks for the detailed testing—super helpful! These errors match known PyTorch issues with Compiled Autograd + distributed/mixed precision: > > 1. Eager/bfloat16: This is a known PyTorch bug in...

@stas00 Following your suggestion, I just created this issue to focus on discussions on the DeviceMesh topic. Please review and feel free to comment.

@stas00 Thanks for the comments! That helps me understand the problem better. Indeed `DeviceMesh` does not provide flexibility for our usage and we must either extend it or base ourselves...

> > mpu provides a series of set_xx APIs for setting world sizes or ranks without touching any created groups. I was wondering what are their primary use cases. >...

@stas00 We can take a closer look at refreshing process group management in the next couple of months. May I know if you have any detailed expectation on the refreshed...

@tohtana Any idea if this two-grad phenomenon is expected? If so, should we add a None check at the beginning of `_backward_prologue_per_tensor`?

> Thank you for reporting, [@eternalNight](https://github.com/eternalNight)! I didn't expect the case. Do you think we can simply skip the scaling when the given value is None? @tohtana I'll investigate why...

@tohtana Here's the story: 1. Llama2 returns a `CausalLMOutputWithPast` (which extends `dict`) which contains a loss tensor (of size 1) and a logits tensor. Deepspeed registers the backward hook on...

The `self._parameters['cls_token']' rank mismatch` reason of the failure could be misleading. Even with a model that does compile (e.g., https://gist.github.com/eternalNight/89ad0639abba0d51ca7777a91d0b07a0 **with line 39-40 commented out**), guard `check `succeeds but `check_verbose`...

> Hi there, so is the question fixed or any instruction to avoid? I don't find an opportunity to dig into torch graph guard logic for the root cause yet....