MBPP splits
(@albertvillanova) The MBPP dataset on the Hub has only a test split for both its "full" and its "sanitized" subset, while the paper states in subsection 2.1 regarding the full split:
In the experiments described later in the paper, we hold out 10 problems for few-shot prompting, another 500 as our test dataset (which is used to evaluate both few-shot inference and fine-tuned models), 374 problems for fine-tuning, and the rest for validation.
If the dataset on the Hub should reproduce most closely what the original authors use, I guess this four-way split should be reflected.
The paper doesn't explicitly state the task_id ranges of the splits, but the GitHub readme referenced in the paper specifies exact task_id ranges, although it misstates the total number of samples:
We specify a train and test split to use for evaluation. Specifically:
- Task IDs 11-510 are used for evaluation.
- Task IDs 1-10 and 511-1000 are used for training and/or prompting. We typically used 1-10 for few-shot prompting, although you can feel free to use any of the training examples.
I.e. the few-shot, train and validation splits are combined into one split, with a soft suggestion of using the first ten for few-shot prompting. It is not explicitly stated whether the 374 fine-tuning samples mentioned in the paper have task_id 511 to 784 or 601 to 974 or are randomly sampled from task_id 511 to 974.
Regarding the "sanitized" split the paper states the following:
For evaluations involving the edited dataset, we perform comparisons with 100 problems that appear in both the original and edited dataset, using the same held out 10 problems for few-shot prompting and 374 problems for fine-tuning.
The statement doesn't appear to be very precise, as among the 10 few-shot problems, those with task_id 1, 5 and 10 are not even part of the sanitized variant, and many from the task_id range from 511 to 974 are missing (e.g. task_id 511 to 553). I suppose the idea the task_id ranges for each split remain the same, even if some of the task_ids are not present. That would result in 7 few-shot, 257 test, 141 train and 22 validation examples in the sanitized split.
Thanks for reporting this as well, @stadlerb.
I suggest waiting for the answer of the data owners...
@albertvillanova The first author of the paper responded to the upstream issue:
Task IDs 11-510 are the 500 test problems. We use 90 problems (511-600) for validation and then remaining 374 for fine-tuning (601-974). The other problems can be used as desired, either for training or few-shot prompting (although this should be specified).
Thanks for the follow-up, @stadlerb.
Would you be willing to open a Pull Request to address this issue? :wink:
Opened a PR to implement this--lmk if you have any feedback