Stas Bekman
Stas Bekman
I think `state_dict` should be re-cloned right after this line: https://github.com/huggingface/transformers/blob/84a6570e7bce91ba7d18c0782186241c5f1fde75/src/transformers/trainer.py#L2872 Please check if I got to the right code branch, I'm doing it by reading the code - so...
Excellent, but we can't do that in `save_pretrained` since we don't want everybody paying a penalty because of a special case. So let's go up the call stack and find...
Excellent. That is the right place, @ArvinZhuang But since the issue comes from Deepspeed, let's see if perhaps the cause can be removed there in the first place, since if...
Please note the discussion continues here: https://github.com/microsoft/DeepSpeed/issues/3303#issuecomment-1516798523 We understand well the cause of the problem - explained at https://github.com/microsoft/DeepSpeed/issues/3303#issuecomment-1516801635 This impacts only z1/z2 models that are sharded. Apparently, FSDP has...
I was just relaying a report from someone else reporting the same problem with FSDP. Perhaps it depends on circumstances. But it doesn't matter who else has this problem. This...
please reread the comment you quoted - it says `clone` and then optionally move to cpu. Your code is missing the key operation.
I suspect the problem comes from `enable_full_determinism` doing this: https://github.com/huggingface/transformers/blob/6587125c0a60f5d5cc207fe1e7fc30d5a0c44a6a/src/transformers/trainer_utils.py#L71 this setting leads to hanging since torch>=1.13 and it's still broken in the current torch==2.0 (and nightly too) See https://github.com/NVIDIA/nccl/issues/750...
Indeed. it took too long to make it into the 2.0.1 cut-off. It should be part of nightly build soon: https://pytorch.org/get-started/locally/
Hi @Raibows, you're giving me no reproduction so there is nothing I can do here as i have no idea what you did. there is no need for tag, deepspeed's...
I totally believe you that this is the case. But I don't have access to your computer. So if there is a bug I need to be able to reproduce...