inkcherry
inkcherry
hi @hijkzzz , Are you referring to the fact that the saved weights are the same between the following two cases? 1. Training with AutoTP, saved using deepspeed.zero.GatheredParameters 2. Training...
In Qwen2.5-3B, the kv_head number is 2. If you want to set kv_head > 2, you would need to replicate the KV heads, and if you only do inference, the...
this class ((de_parallel(model)).eval() ) includes a torch.distributed.group as a member, and deepcopy cannot copy this object. maybe you can either avoid using deepcopy or exclude this variable from being deepcopied.
I have also encountered this problem, "In the two configurations below, both will opt for the 'BF16optimizer' in the selection logic of the optimizer. Despite the differing 'config' settings, seems...
same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675
hi, @tjruwase , I think that the call to no_sync does not originate from the client code. As described in https://github.com/huggingface/transformers/issues/34984, it seems that no_sync is forced to be called...
I think this issue can be fixed by taking in https://github.com/huggingface/transformers/pull/35157