LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Add dataset % sample num equally distribute

Open Katehuuh opened this issue 1 year ago • 2 comments

What does this PR do?

Update https://github.com/hiyouga/LLaMA-Factory/pull/3829 to num_samples include % of dataset and equally distribute the selection.

Useful when the max number of num_samples is unknown.

Before submitting

Edit1: ~~untested yet~~ Tested!

Katehuuh avatar May 30 '24 02:05 Katehuuh

Tested using pair QA dataset, 20epoch

sampleDiffAppleOrange.json:

[
    {
        "instruction": "Which fruit is preferred by Katehuuh?",
        "input": "",
        "output": "Katehuuh prefers apples."
    },
    {
        "instruction": "Can you tell me Katehuuh's favorite fruit?",
        "input": "",
        "output": "Orange is the favorite fruit of Katehuuh."
    },
...

dataset_info.json:

  "sampleDiffAppleOrange": {
    "file_name": "sampleDiffAppleOrange.json",
    "num_samples": "50%",
    "formatting": "alpaca"
  },

"num_samples": "50%" will skip one of the two sample so answer only apples.

Katehuuh avatar Jun 20 '24 09:06 Katehuuh

@hiyouga oi, mind checking PR? 😁

Katehuuh avatar Jun 28 '24 16:06 Katehuuh

Hello, we just used BFG repo cleaner to remove large files in this repo. Unfortunately, this operation accidentally made all PRs invalid. Could you please recreate the same PRs using the latest main branch at your convenience? Thank you so much for your understanding, and we sincerely apologize for any inconvenience this has brought to you.

P.S. You can set https://github.com/hiyouga/LLaMA-Factory-backup as the upstream to find the changes back.

hiyouga avatar Mar 11 '25 09:03 hiyouga