Here , i'm trying to finetune dolphin-2.2.1-mistral-7b.Q2_K.gguf using my own data set as "fichier.txt" can i do it ? and how ? and it's also confusing can someone explain how to use torchtune where i can put my "fichier.txt" .... !!

Apr 23 '24 09:04 walidbet18

@walidbet18 thanks for opening this issue! Unfortunately, we don't currently support fine-tuning GGUF models. Is the model you mentioned available in native pytorch format? I'm guessing this is a specific fine-tune of the mistral 7B that you want to tune further?

For using local files, I believe @RdoubleA has some documentation he'll share shortly.

Apr 23 '24 14:04 kartikayk

@kartikayk thanks for the quick reponse !! well ,currently i'm looking for a way to convert the model from .gguf to a native pytorch format , and yes i'm trying to use mistral 7B and if you can explain to me only how to use torchtune for fine tuning and what configs i have to change, and how i load my data set which is in ".txt" format and how to launch the model after the fine tuning i followed the tutorial but i just want make sure that what i understood is the right way to do it

Apr 23 '24 14:04 walidbet18

Sounds good! Just to confirm, you're looking to fine-tune the mistral 7B model not specifically dolphin-2.2.1-mistral-7b.Q2_K.gguf, is that right? Or do you care about dolphin-2.2.1-mistral-7b.Q2_K.gguf specifically?

Apr 23 '24 14:04 kartikayk

i want to try with mistral-7b and see if it works well , then i'll figure it out with dolphin-2.2.1-mistral-7b.Q2_K.gguf

Apr 23 '24 14:04 walidbet18

Ah ok, then you should be able to use the commands from the README and the tutorials directly, just replace the configs with the one in the mistral folder. So something like:

tune run lora_finetune_single_device --config mistral/7B_lora_single_device

or

tune run lora_finetune_single_device --config mistral/7B_qlora_single_device

@RdoubleA can share instructions on how to use a custom dataset.

Apr 23 '24 15:04 kartikayk

Hi @walidbet18, you can load your local text file dataset by specifying source="text" and data_files="fichier.txt" in any of our dataset classes or builders.

# if you're using chat data
chat_dataset(source="text", data_files="fichier.txt", ...)

We use Hugging Face's load_dataset utility, so any keyword arguments you want to use for load_dataset you can also pass into any of our builders. So anything from their docs on text datasets for example can apply

Apr 24 '24 03:04 RdoubleA

thank you for the answers , i have another question sorry for bothering you i only don't get it how to load my own data set for example i have one file "file.txt" so can i get more explanation on how i load my own data set step by step ( like do i have to put the in yaml config file ) ?

Apr 24 '24 08:04 walidbet18

thank you for the answers , i have another question sorry for bothering you i only don't get it how to load my own data set for example i have one file "file.txt" so can i get more explanation on how i load my own data set step by step ( like do i have to put the in yaml config file ) ?

+1 I am still not finding any document for loading local .csv / .txt dataset

Apr 24 '24 09:04 apthagowda97

atch_size: 4 checkpointer: component: torchtune.utils.FullModelHFCheckpointer checkpoint_dir: /tmp/Mistral-7b-v0.1 checkpoint_files:

pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin model_type: MISTRAL output_dir: /tmp/Mistral-7b-v0.1 recipe_checkpoint: null compile: false dataset: component: torchtune.datasets.instruct_dataset train_on_input: true device: cpu dtype: bf16 enable_activation_checkpointing: true epochs: 3 gradient_accumulation_steps: 4 log_every_n_steps: null loss: component: torch.nn.CrossEntropyLoss lr_scheduler: component: torchtune.modules.get_cosine_schedule_with_warmup num_warmup_steps: 100 max_steps_per_epoch: null metric_logger: component: torchtune.utils.metric_logging.DiskLogger log_dir: /tmp/Mistral-7b-v0.1 model: component: torchtune.models.mistral.lora_mistral_7b apply_lora_to_mlp: true apply_lora_to_output: true lora_alpha: 16 lora_attn_modules:
q_proj
k_proj
v_proj lora_rank: 64 optimizer: component: torch.optim.AdamW lr: 2.0e-05 output_dir: /tmp/Mistral-7b-v0.1 profiler: component: torchtune.utils.profiler enabled: false output_dir: /tmp/alpaca-llama2-finetune/torchtune_perf_tracing.json resume_from_checkpoint: false seed: null shuffle: true tokenizer: component: torchtune.models.mistral.mistral_tokenizer path: /tmp/Mistral-7b-v0.1/tokenizer.model

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 3181919543. Local seed is seed + rank = 3181919543 + 0 Writing logs to /tmp/Mistral-7b-v0.1/log_1713950814.txt Killed

Apr 24 '24 09:04 walidbet18

Ah yes you need to specify it in the config, sorry I wasn't clear earlier. Make sure you setup your dataset component like so:

dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: text
  data_files: fichier.txt
  # other params
  ...

Instruct datasets require columnar data and an InstructTemplate to format the columns into the prompt. @walidbet18 Do you mind sharing what the structure of your data looks like? If it's just unstructured text, then you might need to do a different approach.

We're aware that documentation on setting up custom datasets with local files is lacking, I'm working on improving this in the next few days and can update this thread once it's ready.

Apr 24 '24 13:04 RdoubleA

Actually yes , it's just unstructered text this is why i thought i should be using instruct_dataset

Tokenizer

tokenizer: component: torchtune.models.mistral.mistral_tokenizer path: /tmp/Mistral-7B-v0.1/tokenizer.model

Dataset

dataset: component: torchtune.datasets.instruct_dataset train_on_input: True source: text data_files: texte_recupere.txt seed: null shuffle: True

Model Arguments

model: component: torchtune.models.mistral.lora_mistral_7b lora_attn_modules: ['q_proj', 'k_proj', 'v_proj'] apply_lora_to_mlp: True apply_lora_to_output: True lora_rank: 64 lora_alpha: 16

checkpointer: component: torchtune.utils.FullModelHFCheckpointer checkpoint_dir: /tmp/Mistral-7B-v0.1 checkpoint_files: [ pytorch_model-00001-of-00002.bin, pytorch_model-00002-of-00002.bin ] recipe_checkpoint: null output_dir: /tmp/Mistral-7B-v0.1 model_type: MISTRAL resume_from_checkpoint: False

optimizer: component: torch.optim.AdamW lr: 2e-5

lr_scheduler: component: torchtune.modules.get_cosine_schedule_with_warmup num_warmup_steps: 100

loss: component: torch.nn.CrossEntropyLoss

Fine-tuning arguments

batch_size: 4 epochs: 3 max_steps_per_epoch: null gradient_accumulation_steps: 4 compile: False

Training env

device: cpu

Memory management

enable_activation_checkpointing: True

Reduced precision

dtype: bf16

Logging

metric_logger: component: torchtune.utils.metric_logging.DiskLogger log_dir: ${output_dir} output_dir: /tmp/Mistral-7B-v0.1 log_every_n_steps: null

Show case the usage of pytorch profiler

Set enabled to False as it's only needed for debugging training

profiler: component: torchtune.utils.profiler enabled: False output_dir: /tmp/alpaca-llama2-finetune/torchtune_perf_tracing.json

does seem right to you ?

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 3426960755. Local seed is seed + rank = 3426960755 + 0 Writing logs to /tmp/Mistral-7B-v0.1/log_1713966980.txt Killed

and the killed i'm having at the end is it maybe because of the memory ?

Apr 24 '24 13:04 walidbet18

If you're using unstructured text then you might need to use a different dataset class. I am planning to open a PR soon to add this to enable fine-tuning / continued pre-training on unstructured text data.

As for why the run got killed, we actually don't support training on cpu. You will need to run this recipe on a single GPU, if you're able to access one.

Apr 24 '24 14:04 RdoubleA

oh okey thanks , well the data i'm using a text extracted from an html page so it's an unstructured text , so which dataclass i have to use ?

Apr 24 '24 14:04 walidbet18

@walidbet18 please feel free to reopen this issue if you still need help! Closing this issue for now.

Jun 28 '24 15:06 felipemello1

torchtune
torchtune copied to clipboard

Can i fine tune "dolphin-2.2.1-mistral-7b.Q2_K.gguf" with torchtune ? using cpu ?

Tokenizer

Dataset

Model Arguments

Fine-tuning arguments

Training env

Memory management

Reduced precision

Logging

Show case the usage of pytorch profiler

Set enabled to False as it's only needed for debugging training

torchtune torchtune copied to clipboard

Can i fine tune "dolphin-2.2.1-mistral-7b.Q2_K.gguf" with torchtune ? using cpu ?

Tokenizer

Dataset

Model Arguments

Fine-tuning arguments

Training env

Memory management

Reduced precision

Logging

Show case the usage of pytorch profiler

Set enabled to False as it's only needed for debugging training

torchtune
torchtune copied to clipboard