Add dataset % sample num equally distribute

Open Katehuuh opened this issue 1 year ago • 2 comments

What does this PR do?

Update https://github.com/hiyouga/LLaMA-Factory/pull/3829 to num_samples include % of dataset and equally distribute the selection.

Useful when the max number of num_samples is unknown.

Before submitting

[x] Did you read the contributor guideline?

Edit1: ~~untested yet~~ Tested!

May 30 '24 02:05 Katehuuh

Tested using pair QA dataset, 20epoch

sampleDiffAppleOrange.json:

[
    {
        "instruction": "Which fruit is preferred by Katehuuh?",
        "input": "",
        "output": "Katehuuh prefers apples."
    },
    {
        "instruction": "Can you tell me Katehuuh's favorite fruit?",
        "input": "",
        "output": "Orange is the favorite fruit of Katehuuh."
    },
...

dataset_info.json:

  "sampleDiffAppleOrange": {
    "file_name": "sampleDiffAppleOrange.json",
    "num_samples": "50%",
    "formatting": "alpaca"
  },

"num_samples": "50%" will skip one of the two sample so answer only apples.

Jun 20 '24 09:06 Katehuuh

@hiyouga oi, mind checking PR? 😁

Jun 28 '24 16:06 Katehuuh

Hello, we just used BFG repo cleaner to remove large files in this repo. Unfortunately, this operation accidentally made all PRs invalid. Could you please recreate the same PRs using the latest main branch at your convenience? Thank you so much for your understanding, and we sincerely apologize for any inconvenience this has brought to you.

P.S. You can set https://github.com/hiyouga/LLaMA-Factory-backup as the upstream to find the changes back.

Mar 11 '25 09:03 hiyouga