llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

FSDP Finetuned Model-optimizer and tokenizer

Open waterluck opened this issue 9 months ago • 3 comments

Thanks for the tutorials! I have several small questions about the model ft and usage.

When doing Full parameter finetune using FSDP only,

Q1: should we use save_optimizer to True or not? I first set it to True, and I found the model goes to very large, I fine-tuned on 10K Pawsx data samples, got __0_0.distcp ~ __3_0.distcp with each file 9.4GB large, and 2 extra optimizer-xxx.pt file like optimizer-llama-2-7b-0.pt with 25GB each. And when I set it to false, I got 4x__0_0.distcp file from 0 to 4, with 3.14GB each. I'm unsure whether it's normal or not to be that large, and whether save_optimizer is necessary.

Q2: Is the llama2-xB-hf and llama2-xB-hf-chat series model use the same tokenizer? There's no tokenizer.model file from fine-tuned model, and I noticed the size of these 2 model's tokenizer files looks the same in the official repository; I want to know whether their tokenizer remains the same, especially the tokenizer_model in the model file. also, can we use fast_tokenizer in llama2?

Q3: When SFT on llama on classification task, with a single target label, is there any influence if not train on input, which sets the input to be -100`.

Thanks if you can take a look of these questions.

waterluck avatar Apr 30 '24 19:04 waterluck

Hi @waterluck

Q1: What looks a bit weird to me is that the __0_X.distcp files get bigger when you you store the optimizer as well. Will need to look into this to confirm this is right or an error. Regarding if saving the optimizer is necessary depends on your use case. I you want to continue the training from that point on the optimizer state can be useful as you're not doing a cold start but the optimizer state contains the current optimization direction on the surface thats optimized. If you are not planning to continue the training you can skip saving it. Q2: the tokenizer for these models are equivalent as the chat variant is a fine-tuned version of the base model. If you use AutoTokenizer it will automatically select the fast tokenizer if available.

Hope that helps.

mreso avatar May 02 '24 23:05 mreso

Hi @mreso , thanks for the confirmation! also regrading the whole finetuning process, I noticed that when run several times with all the same parameter settings, the loss at each epoch differs big, I checked all the parameters is the same, and I didn't change the random_seed(which I think is fixed to 42), is this expected? or if there any other steps in the code can bring randomness.

image

waterluck avatar May 03 '24 07:05 waterluck

Some ops use non-deterministic algorithms so some fluctuation is expected. See https://pytorch.org/docs/stable/notes/randomness.html if you can disable non-deterministic behavior but beware that this will have an impact on your training performance.

mreso avatar May 03 '24 19:05 mreso

Great! Thanks for your answer, it helps a lot.

waterluck avatar May 04 '24 06:05 waterluck