gpt-neox Fine-tuning 20B model doesn't seem to work

Hi,

I'm trying to fine-tune the 20B model, I tried the current version of the code and this one. I am using Docker, and I tried several images from the past year (the most recent ones up to the one labeled as "release"). I tried both the slim and full weights.

I tried nodes of 8 and 16 A100s-40GB, so I don't think it is a memory issue. I am using the 20B.yml config file and I am adding: { "finetune": true, "no_load_optim": true, "no_load_rng": true, "iteration": 0 }

With the newer Docker images, I get an error that says "Empty ds_version in checkpoint", I guess this is related to this issue.

However, when I use the older Docker images (with both the new and legacy version of the code) I get an error that says AttributeError: 'NoneType' object has no attribute 'dp_process_group'. I guess this is related to this issue. As someone said at the time, "this is an error with deepspeed trying to load zero optimizer states if you specify one in your config, even if we set load_optim to false." Setting the state to 0, the model loads but crashes later (similar to this issue.)

Do you have an idea? Thank you!

Jan 10 '23 01:01 abar-75

Hey, have you found any progress with this?

Jan 13 '23 21:01 FayZ676

No, still having the issue

Jan 14 '23 04:01 abar-75

I’m sorry you’ve been having trouble with this. We are aware of the issue but do not have the personnel to prioritize patching this at this time. At this time I recommend using the HuggingFace transformers library for finetuning the model.

If you are interested in developing and contributing a patch, we would be ecstatic to merge it into main to prevent others from struggling with this.

Jan 15 '23 23:01 StellaAthena

@StellaAthena I've also been experience problems trying to fine tune the 20B model, I'll try using HF. Thanks

Jan 16 '23 20:01 FayZ676

did anyone get over this? Maybe there is memory leak somewhere, even if flash attention, I am getting OOM. This seems to be abnormal.

Jan 19 '23 01:01 kyleliang919

@afeb-75 can you provide a more thorough stack trace for where the "Empty ds_version in checkpoint" error is coming from--I cannot reproduce it. When you say the "current version" of the code, are you referring to the main branch as it exists right now (or, rather, when you opened this issue)? Issue #732 that you linked to I can reproduce, but only with the deepspeed_main branch that's part of PR #663 and hasn't been finalized. Is that what you were using?

Jan 21 '23 00:01 dashstander

I did some initial probing and I found that somehow there is huge memory footprint with the megatron attention. Some of those extra and absolutely unnecessary shape transforms could be causing this though I am not completely sure which one.

Jan 21 '23 00:01 kyleliang919

@kyleliang919 do you think that that is related to the issue described at the top of this thread by @afeb-75 ? I don't see the connection.

Jan 21 '23 01:01 dashstander

oh sorry, I think I commented on the wrong issue. Please ignore my comments.

Jan 21 '23 01:01 kyleliang919

@afeb-75 -- Is there a reason you can't load the model with zero stage 1?

Jan 24 '23 00:01 Quentin-Anthony

I believe the problem is that the model's modules are all frozen and have requires_grad set to False. You can verify this with:

for name, param in model.named_parameters(recurse=True):
    print(f"{name}: {param.requires_grad}")

Apr 20 '23 16:04 winglian

@afeb-75 can you try out the code above?

Apr 23 '23 15:04 StellaAthena

@StellaAthena i found same issue too, there is no probloem that model parameters are frozen can i fix this problem that convert neox to hf model and convert hf model to neox checkpoint again?

Apr 26 '23 22:04 taegyeongeo

@taegyeongeo this thread has a couple issues mentioned. Which one are you experiencing?

Apr 27 '23 18:04 StellaAthena

@StellaAthena finetuning one. i solved this problem now on 6b model and i want to try to contribute about this but, i have no enough resource for training models

so, can i get compute instance for testing code?

Apr 28 '23 10:04 taegyeongeo

@StellaAthena finetuning one. i solved this problem now on 6b model and i want to try to contribute about this but, i have no enough resource for training models

so, can i get compute instance for testing code?

@taegyeongeo what ended up being the solution to get 6b trainable?

May 05 '23 17:05 winglian

this issue

have you solved the issue? met the same problem.

May 09 '23 15:05 WaveLi123

gpt-neox gpt-neox copied to clipboard

Fine-tuning 20B model doesn't seem to work

gpt-neox
gpt-neox copied to clipboard