dolly
dolly copied to clipboard
How to train V2?
The train_dolly.py
file seems to still be on V1 since DEFAULT_TRAINING_DATASET
is set to tatsu-lab/alpaca
. What needs to be changed to train V2?
@srowen any update on dolly 2.0 fine tuning?
@matthayes is working on it
@srowen thanks for the reply. Do you think fine tuning pythia 12B for text generation and then again fine tuning it for instructional dataset would result in better domain specific answers as compared to using pretrained Pythia12B with only fine tuning on custom instructions dataset? Like what would be your suggestions for achieving a domain specific LLM.
PS - only looking for completely open source solutions.
I would do the latter, no real point in fine tuning twice
@srowen Thank you for the suggestion but won't it be biased towards the knowledge it already has from pretrained text generation? Like if someone asks what is AI? There would be a specific definition it already knows but according to know instructional data which I am fine tuning it on has some other definition of AI. Then what can I expect in terms of output? Past definition or instructional fine tuned custom data definition?
It will be a mix of the two. I don't think you actually want the model to unlearn everything, even if you want certain facts to take precedence. As far as I know it's a matter of fine-tuning longer on your dataset to push it to answer like your examples.
Got it. My data is very facts specific to facts and discussions. Didn't wanted to get the bias from its past knowledge about the facts which might be true but not factual according to my analysis so just thought of finetuning the underlining model as well. But I got you point. Looking forward for dolly 2.0 fine tuning code. 😊 Thanks alot for the help
Thanks for your patience. V2 training code was added in #88.
@srowen thanks for the reply. Do you think fine tuning pythia 12B for text generation and then again fine tuning it for instructional dataset would result in better domain specific answers as compared to using pretrained Pythia12B with only fine tuning on custom instructions dataset? Like what would be your suggestions for achieving a domain specific LLM.
PS - only looking for completely open source solutions.
if you have a very-out-of-domain internal corpus, I'd think you'd want to first tune pythia w/ your docs and _then_instruction tune w/ the dolly dataset.
what did you end up doing?
@srowen thanks for the reply. Do you think fine tuning pythia 12B for text generation and then again fine tuning it for instructional dataset would result in better domain specific answers as compared to using pretrained Pythia12B with only fine tuning on custom instructions dataset? Like what would be your suggestions for achieving a domain specific LLM.
PS - only looking for completely open source solutions.
I've the similar use case too. How can I fine-tune Dolly v2 on the custom domain instruction-following dataset?
Can we fine-tune/train on the custom dataset like this?
deepspeed \
--module training.trainer \
--deepspeed $deepspeed_config \
--training-dataset $PATH_TO_CUSTOM_DOMAIN_SPECIFIC_DATA
--epochs 1 \
--local-output-dir $local_output_dir \
--dbfs-output-dir "" \
--per-device-train-batch-size 3 \
--per-device-eval-batch-size 3 \
--lr 1e-5 \
--warmup-steps 50 \
--input-model $input_model \
--logging-steps 1000 \
--test-size 1000
--training-dataset $PATH_TO_CUSTOM_DOMAIN_SPECIFIC_DATA
==> pointing to the new dataset like this would work?
(and --epochs
as per the choice).
Right now you have to modify the code to set a different path to training data, but yes