torchtune
torchtune copied to clipboard
Can i fine tune "dolphin-2.2.1-mistral-7b.Q2_K.gguf" with torchtune ? using cpu ?
Here , i'm trying to finetune dolphin-2.2.1-mistral-7b.Q2_K.gguf using my own data set as "fichier.txt" can i do it ? and how ? and it's also confusing can someone explain how to use torchtune where i can put my "fichier.txt" .... !!
@walidbet18 thanks for opening this issue! Unfortunately, we don't currently support fine-tuning GGUF models. Is the model you mentioned available in native pytorch format? I'm guessing this is a specific fine-tune of the mistral 7B that you want to tune further?
For using local files, I believe @RdoubleA has some documentation he'll share shortly.
@kartikayk thanks for the quick reponse !! well ,currently i'm looking for a way to convert the model from .gguf to a native pytorch format , and yes i'm trying to use mistral 7B and if you can explain to me only how to use torchtune for fine tuning and what configs i have to change, and how i load my data set which is in ".txt" format and how to launch the model after the fine tuning i followed the tutorial but i just want make sure that what i understood is the right way to do it
Sounds good! Just to confirm, you're looking to fine-tune the mistral 7B model not specifically dolphin-2.2.1-mistral-7b.Q2_K.gguf, is that right? Or do you care about dolphin-2.2.1-mistral-7b.Q2_K.gguf specifically?
i want to try with mistral-7b and see if it works well , then i'll figure it out with dolphin-2.2.1-mistral-7b.Q2_K.gguf
Ah ok, then you should be able to use the commands from the README and the tutorials directly, just replace the configs with the one in the mistral folder. So something like:
tune run lora_finetune_single_device --config mistral/7B_lora_single_device
or
tune run lora_finetune_single_device --config mistral/7B_qlora_single_device
@RdoubleA can share instructions on how to use a custom dataset.
Hi @walidbet18, you can load your local text file dataset by specifying source="text" and data_files="fichier.txt" in any of our dataset classes or builders.
# if you're using chat data
chat_dataset(source="text", data_files="fichier.txt", ...)
We use Hugging Face's load_dataset utility, so any keyword arguments you want to use for load_dataset you can also pass into any of our builders. So anything from their docs on text datasets for example can apply
thank you for the answers , i have another question sorry for bothering you i only don't get it how to load my own data set for example i have one file "file.txt" so can i get more explanation on how i load my own data set step by step ( like do i have to put the in yaml config file ) ?
thank you for the answers , i have another question sorry for bothering you i only don't get it how to load my own data set for example i have one file "file.txt" so can i get more explanation on how i load my own data set step by step ( like do i have to put the in yaml config file ) ?
+1 I am still not finding any document for loading local .csv / .txt dataset
atch_size: 4 checkpointer: component: torchtune.utils.FullModelHFCheckpointer checkpoint_dir: /tmp/Mistral-7b-v0.1 checkpoint_files:
- pytorch_model-00001-of-00002.bin
- pytorch_model-00002-of-00002.bin model_type: MISTRAL output_dir: /tmp/Mistral-7b-v0.1 recipe_checkpoint: null compile: false dataset: component: torchtune.datasets.instruct_dataset train_on_input: true device: cpu dtype: bf16 enable_activation_checkpointing: true epochs: 3 gradient_accumulation_steps: 4 log_every_n_steps: null loss: component: torch.nn.CrossEntropyLoss lr_scheduler: component: torchtune.modules.get_cosine_schedule_with_warmup num_warmup_steps: 100 max_steps_per_epoch: null metric_logger: component: torchtune.utils.metric_logging.DiskLogger log_dir: /tmp/Mistral-7b-v0.1 model: component: torchtune.models.mistral.lora_mistral_7b apply_lora_to_mlp: true apply_lora_to_output: true lora_alpha: 16 lora_attn_modules:
- q_proj
- k_proj
- v_proj lora_rank: 64 optimizer: component: torch.optim.AdamW lr: 2.0e-05 output_dir: /tmp/Mistral-7b-v0.1 profiler: component: torchtune.utils.profiler enabled: false output_dir: /tmp/alpaca-llama2-finetune/torchtune_perf_tracing.json resume_from_checkpoint: false seed: null shuffle: true tokenizer: component: torchtune.models.mistral.mistral_tokenizer path: /tmp/Mistral-7b-v0.1/tokenizer.model
DEBUG:torchtune.utils.logging:Setting manual seed to local seed 3181919543. Local seed is seed + rank = 3181919543 + 0 Writing logs to /tmp/Mistral-7b-v0.1/log_1713950814.txt Killed
Ah yes you need to specify it in the config, sorry I wasn't clear earlier. Make sure you setup your dataset component like so:
dataset:
_component_: torchtune.datasets.instruct_dataset
source: text
data_files: fichier.txt
# other params
...
Instruct datasets require columnar data and an InstructTemplate to format the columns into the prompt. @walidbet18 Do you mind sharing what the structure of your data looks like? If it's just unstructured text, then you might need to do a different approach.
We're aware that documentation on setting up custom datasets with local files is lacking, I'm working on improving this in the next few days and can update this thread once it's ready.
Actually yes , it's just unstructered text this is why i thought i should be using instruct_dataset
Tokenizer
tokenizer: component: torchtune.models.mistral.mistral_tokenizer path: /tmp/Mistral-7B-v0.1/tokenizer.model
Dataset
dataset: component: torchtune.datasets.instruct_dataset train_on_input: True source: text data_files: texte_recupere.txt seed: null shuffle: True
Model Arguments
model: component: torchtune.models.mistral.lora_mistral_7b lora_attn_modules: ['q_proj', 'k_proj', 'v_proj'] apply_lora_to_mlp: True apply_lora_to_output: True lora_rank: 64 lora_alpha: 16
checkpointer: component: torchtune.utils.FullModelHFCheckpointer checkpoint_dir: /tmp/Mistral-7B-v0.1 checkpoint_files: [ pytorch_model-00001-of-00002.bin, pytorch_model-00002-of-00002.bin ] recipe_checkpoint: null output_dir: /tmp/Mistral-7B-v0.1 model_type: MISTRAL resume_from_checkpoint: False
optimizer: component: torch.optim.AdamW lr: 2e-5
lr_scheduler: component: torchtune.modules.get_cosine_schedule_with_warmup num_warmup_steps: 100
loss: component: torch.nn.CrossEntropyLoss
Fine-tuning arguments
batch_size: 4 epochs: 3 max_steps_per_epoch: null gradient_accumulation_steps: 4 compile: False
Training env
device: cpu
Memory management
enable_activation_checkpointing: True
Reduced precision
dtype: bf16
Logging
metric_logger: component: torchtune.utils.metric_logging.DiskLogger log_dir: ${output_dir} output_dir: /tmp/Mistral-7B-v0.1 log_every_n_steps: null
Show case the usage of pytorch profiler
Set enabled to False as it's only needed for debugging training
profiler: component: torchtune.utils.profiler enabled: False output_dir: /tmp/alpaca-llama2-finetune/torchtune_perf_tracing.json
does seem right to you ?
DEBUG:torchtune.utils.logging:Setting manual seed to local seed 3426960755. Local seed is seed + rank = 3426960755 + 0 Writing logs to /tmp/Mistral-7B-v0.1/log_1713966980.txt Killed
and the killed i'm having at the end is it maybe because of the memory ?
If you're using unstructured text then you might need to use a different dataset class. I am planning to open a PR soon to add this to enable fine-tuning / continued pre-training on unstructured text data.
As for why the run got killed, we actually don't support training on cpu. You will need to run this recipe on a single GPU, if you're able to access one.
oh okey thanks , well the data i'm using a text extracted from an html page so it's an unstructured text , so which dataclass i have to use ?
@walidbet18 please feel free to reopen this issue if you still need help! Closing this issue for now.