Wang, Yi

Results 69 comments of Wang, Yi

find deepspeed zero3+prompt tunning hang after saving checkpoint. fix by https://github.com/huggingface/transformers/pull/29980. will port similar change to optimum habana trainer.

@regisss arg bool does not work, see https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse. anyway, I picked up one solution to fix it.

generated wav file doesn't sound right. this issue is fixed by https://github.com/huggingface/optimum-habana/pull/1034

> the prompt-tuning has prompt_encoder forward in model save_pretrained. only rank0 call the prompt_encoder forward in deepspeed zero3 is not enough. all the rank should call prompt_encoder forward. thanks for...

> Again though: this is deepspeed specific based on model sharding/splitting, so we should only modify the code for deepspeed specifically, unless there is a reason not to. (Aka, deepspeed...

@muellerzr I have updated the PR, could you revisit it?

> Thanks for adding this, but having this fix indicates that there's likely something wrong in how we control our saving logic more generally. > > Having to have lots...

> @sywangyi Thanks for the explanation. I understand the intended logic. My previous comment still stands: we shouldn't need to condition so much of the `_save` logic on the `should_save`...

also. in `save_pretrained` (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L2503 or https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py#L294) is_main_process is equal to `should_save`,`should_save` is controlled by https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L2322. not equal to `process_index` == 0. If you see `_save_tpu` logic, it's what I would...