Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Augment MATH dataset with <work> tags

Open hecko-yes opened this issue 2 years ago • 4 comments

The Mathematics Aptitude Test of Heuristics (MATH) dataset consists of 7500 (+ 5000 test) math problems and step-by-step solutions, written in a mixture of natural language and LaTeX.

Plan:

  • [ ] manually augment a few examples (collaborative doc here with the first 10 of each of the 7 categories)
  • [ ] use the data to finetune/prompt an LLM to augment the rest
  • [ ] filter data based on whether or not running the <work> tags results in the <output> tags; manually fix some failures, use as extra training data for the augmenting LLM (might be unnecessary)
  • [ ] (bonus) apply the same process to the Khan Academy part of the related AMPS pretraining dataset, which at a glance seems to have a similar format

hecko-yes avatar Jan 16 '23 18:01 hecko-yes

Some examples involve intermediate steps that a human would reasonably perform, but which are hard or unnecessary to do with sympy. Examples include:

  • algebra/0: "This implies $a(2)+3=2-5$, which we solve to get $2a-6 \Rightarrow a=-3$." (first step is hard, but probably not impossible?)
  • counting_and_probability/8: $\frac{12\cdot 6 + 6 \cdot 3 + 4 \cdot 2}{22 \cdot 21} = \frac{98}{33\cdot 14} = \boxed{\frac{7}{33}}$ (simplification to 33*14 is arbitrary from a computer perspective)

Should those steps be removed?

~~Additionally, some calculations involve such small numbers that the final model should have no problem answering them directly, e.g. counting_and_probability/1's $(9+1)^3 = 10^3 = \boxed{1000}$. Should those be annotated with <work> tags regardless?~~ (Answered by Huu Nguyen on Discord: yes, they should be)

hecko-yes avatar Jan 16 '23 18:01 hecko-yes

Related issue #602

huu4ontocord avatar Jan 17 '23 06:01 huu4ontocord

@Sobsz - i think you mentioned we could try to get others to help with this. Can you share whatever code you have so far either through a PR or cut and paste here?

Team: we need someone to help with getting some basic math datasets into the next dataset.

huu4ontocord avatar Jan 27 '23 18:01 huu4ontocord

I'm currently still at the manual data creation step, see the doc linked in the first message. (I'll work on it soon, I swear!) Once that's done, help would be appreciated with using it to few-shot an LLM into doing the same to the rest of the dataset.

Note that this is very much not a "basic" math dataset. What are you looking for, exactly? Basic arithmetic? Step-by-step, tags, or just the LLM's intuition?

hecko-yes avatar Jan 27 '23 18:01 hecko-yes

Closing old data issue.

andreaskoepf avatar Jun 14 '23 08:06 andreaskoepf