Add dataset % sample num equally distribute
What does this PR do?
Update https://github.com/hiyouga/LLaMA-Factory/pull/3829 to num_samples include % of dataset and equally distribute the selection.
Useful when the max number of num_samples is unknown.
Before submitting
- [x] Did you read the contributor guideline?
Edit1: ~~untested yet~~ Tested!
Tested using pair QA dataset, 20epoch
sampleDiffAppleOrange.json:
[
{
"instruction": "Which fruit is preferred by Katehuuh?",
"input": "",
"output": "Katehuuh prefers apples."
},
{
"instruction": "Can you tell me Katehuuh's favorite fruit?",
"input": "",
"output": "Orange is the favorite fruit of Katehuuh."
},
...
dataset_info.json:
"sampleDiffAppleOrange": {
"file_name": "sampleDiffAppleOrange.json",
"num_samples": "50%",
"formatting": "alpaca"
},
"num_samples": "50%" will skip one of the two sample so answer only apples.
@hiyouga oi, mind checking PR? 😁
Hello, we just used BFG repo cleaner to remove large files in this repo. Unfortunately, this operation accidentally made all PRs invalid. Could you please recreate the same PRs using the latest main branch at your convenience? Thank you so much for your understanding, and we sincerely apologize for any inconvenience this has brought to you.
P.S. You can set https://github.com/hiyouga/LLaMA-Factory-backup as the upstream to find the changes back.