Benchmark suite

Open LouisCastricato opened this issue 2 years ago • 5 comments

We should use RL4LMs benchmark suite, I think it is a strong candidate to show the strengths and weaknesses of TRLX.

Oct 06 '22 16:10 LouisCastricato

Ideas for tasks

web searching: wikipedia race
chess
- A chess DT is not trained on natural language, it’s trained on a formal language encoding chess moves. So if your Stockfish provides feedback is the sequence of moves that refutes the line the DT was proposing, that is actually “natural language feedback” in the context of the toy task
summarization
sentiments
HHH data
GRUE benchmark
Preference one model and train new model to see how long it takes to start imitating new model
Out of the box usability: do I have to "prompt" my models less?
How well does model incorporate feedback

Oct 10 '22 17:10 Dahoas

Summarization Data with Human Feedback from OpenAI: https://github.com/openai/summarize-from-feedback#human-feedback-data

@Dahoas

Oct 11 '22 19:10 PhungVanDuy

@PhungVanDuy @Dahoas I'd love to help on the summarization task, what's the current status?

Oct 31 '22 17:10 thedch

Duy has implemented it and has it working apparently.

Oct 31 '22 17:10 LouisCastricato

@PhungVanDuy @Dahoas I'd love to help on the summarization task, what's the current status?

I have an implementation here, but it's pretty messy right now, I'm reviewing some results and doing hyperparameter tuning. If it worked well I will clean the code then.

Oct 31 '22 17:10 PhungVanDuy

trlx trlx copied to clipboard

Benchmark suite

Ideas for tasks

trlx
trlx copied to clipboard