inkcherry comments

Results 17 comments of


                                            inkcherry

Is there more detailed documentation for HF AutoTP training?

hi @hijkzzz , Are you referring to the fact that the saved weights are the same between the following two cases? 1. Training with AutoTP, saved using deepspeed.zero.GatheredParameters 2. Training...

[BUG] AutoTP fails for Qwen2.5 models when tp size > 2

In Qwen2.5-3B, the kv_head number is 2. If you want to set kv_head > 2, you would need to replicate the KV heads, and if you only do inference, the...

[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

this class ((de_parallel(model)).eval() ) includes a torch.distributed.group as a member, and deepcopy cannot copy this object. maybe you can either avoid using deepcopy or exclude this variable from being deepcopied.

fix the bug of save bf16 optimizer state in the bf16+zero1+pp mode

I have also encountered this problem， "In the two configurations below, both will opt for the 'BF16optimizer' in the selection logic of the optimizer. Despite the differing 'config' settings, seems...

AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3

same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675

AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3

hi, @tjruwase , I think that the call to no_sync does not originate from the client code. As described in https://github.com/huggingface/transformers/issues/34984, it seems that no_sync is forced to be called...

AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3

I think this issue can be fixed by taking in https://github.com/huggingface/transformers/pull/35157