axolotl Unable to load ORPO dataset in a *.json file

Unable to load ORPO dataset in a *.json file

Open SicariusSicariiStuff opened this issue 6 months ago • 1 comments

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Local dataset to work the same as with loading a dataset from HF hub

Current behaviour

FileNotFoundError: Couldn't find a dataset script at

Steps to reproduce

If you have:

rl: orpo
orpo_alpha: 0.1
chat_template: chatml
datasets:
  - path: HF_username/Dataset_name
    type: chat_template.argilla
    chat_template: chatml

Replace it with the same file locally (parquet\json doesn't matter) And you'll get :

FileNotFoundError: Couldn't find a dataset script at

Config yaml

base_model: SicariusSicariiStuff/2B-ad
output_dir: /home/sicarius/test/
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
save_safetensors: true

num_epochs: 2
saves_per_epoch: 1
saves_per_epoch: 1
save_total_limit: 2

learning_rate: 4e-6
lora_r: 16
lora_alpha: 32

sequence_len: 1024

lora_target_modules:

rl: orpo
orpo_alpha: 0.1
chat_template: chatml
datasets:
  - path: /home/sicarius/test/orpo1.json
    type: chat_template.argilla
    chat_template: chatml

remove_unused_columns: false
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: false

val_set_size: 0.0


adapter: qlora
lora_dropout: 0
lora_target_linear: true
load_in_8bit: false
load_in_4bit: true
strict: false

gradient_accumulation_steps: 1
micro_batch_size: 1

#optimizer: adamw_torch
optimizer: adamw_bnb_8bit
lr_scheduler: cosine


train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 0
#warmup_ratio: 0.1
evals_per_epoch:
eval_table_size:
eval_max_new_tokens: 128

debug:
#deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
lora_modules_to_save: [embed_tokens, lm_head]
special_tokens:
  eos_token: "<|im_end|>"
  pad_token: "<|end_of_text|>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

Using a similar processing logic as in a loaded dataset from the hub

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

latest release

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Aug 26 '24 15:08 SicariusSicariiStuff

axolotl axolotl copied to clipboard

Unable to load ORPO dataset in a *.json file

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard