llama-recipes
llama-recipes copied to clipboard
FSDP Finetuned Model-optimizer and tokenizer
Thanks for the tutorials! I have several small questions about the model ft and usage.
When doing Full parameter finetune using FSDP only,
Q1: should we use save_optimizer
to True or not?
I first set it to True, and I found the model goes to very large, I fine-tuned on 10K Pawsx data samples, got __0_0.distcp
~ __3_0.distcp
with each file 9.4GB large, and 2 extra optimizer-xxx.pt
file like optimizer-llama-2-7b-0.pt
with 25GB each.
And when I set it to false, I got 4x__0_0.distcp
file from 0 to 4, with 3.14GB each.
I'm unsure whether it's normal or not to be that large, and whether save_optimizer is necessary.
Q2: Is the llama2-xB-hf
and llama2-xB-hf-chat
series model use the same tokenizer?
There's no tokenizer.model file from fine-tuned model, and I noticed the size of these 2 model's tokenizer files looks the same in the official repository;
I want to know whether their tokenizer remains the same, especially the tokenizer_model
in the model file.
also, can we use fast_tokenizer
in llama2?
Q3: When SFT on llama on classification task, with a single target label, is there any influence if not train on input, which sets the input to be
-100`.
Thanks if you can take a look of these questions.
Hi @waterluck
Q1: What looks a bit weird to me is that the __0_X.distcp files get bigger when you you store the optimizer as well. Will need to look into this to confirm this is right or an error. Regarding if saving the optimizer is necessary depends on your use case. I you want to continue the training from that point on the optimizer state can be useful as you're not doing a cold start but the optimizer state contains the current optimization direction on the surface thats optimized. If you are not planning to continue the training you can skip saving it. Q2: the tokenizer for these models are equivalent as the chat variant is a fine-tuned version of the base model. If you use AutoTokenizer it will automatically select the fast tokenizer if available.
Hope that helps.
Hi @mreso , thanks for the confirmation! also regrading the whole finetuning process, I noticed that when run several times with all the same parameter settings, the loss at each epoch differs big, I checked all the parameters is the same, and I didn't change the random_seed(which I think is fixed to 42), is this expected? or if there any other steps in the code can bring randomness.
Some ops use non-deterministic algorithms so some fluctuation is expected. See https://pytorch.org/docs/stable/notes/randomness.html if you can disable non-deterministic behavior but beware that this will have an impact on your training performance.
Great! Thanks for your answer, it helps a lot.