Open-Assistant
Open-Assistant copied to clipboard
Augment MATH dataset with <work> tags
The Mathematics Aptitude Test of Heuristics (MATH) dataset consists of 7500 (+ 5000 test) math problems and step-by-step solutions, written in a mixture of natural language and LaTeX.
Plan:
- [ ] manually augment a few examples (collaborative doc here with the first 10 of each of the 7 categories)
- [ ] use the data to finetune/prompt an LLM to augment the rest
- [ ] filter data based on whether or not running the
<work>tags results in the<output>tags; manually fix some failures, use as extra training data for the augmenting LLM (might be unnecessary) - [ ] (bonus) apply the same process to the Khan Academy part of the related AMPS pretraining dataset, which at a glance seems to have a similar format
Some examples involve intermediate steps that a human would reasonably perform, but which are hard or unnecessary to do with sympy. Examples include:
- algebra/0: "This implies $a(2)+3=2-5$, which we solve to get $2a-6 \Rightarrow a=-3$." (first step is hard, but probably not impossible?)
- counting_and_probability/8: $\frac{12\cdot 6 + 6 \cdot 3 + 4 \cdot 2}{22 \cdot 21} = \frac{98}{33\cdot 14} = \boxed{\frac{7}{33}}$ (simplification to 33*14 is arbitrary from a computer perspective)
Should those steps be removed?
~~Additionally, some calculations involve such small numbers that the final model should have no problem answering them directly, e.g. counting_and_probability/1's $(9+1)^3 = 10^3 = \boxed{1000}$. Should those be annotated with <work> tags regardless?~~ (Answered by Huu Nguyen on Discord: yes, they should be)
Related issue #602
@Sobsz - i think you mentioned we could try to get others to help with this. Can you share whatever code you have so far either through a PR or cut and paste here?
Team: we need someone to help with getting some basic math datasets into the next dataset.
I'm currently still at the manual data creation step, see the doc linked in the first message. (I'll work on it soon, I swear!) Once that's done, help would be appreciated with using it to few-shot an LLM into doing the same to the rest of the dataset.
Note that this is very much not a "basic" math dataset. What are you looking for, exactly? Basic arithmetic? Step-by-step,
Closing old data issue.