Thaumstrial
Thaumstrial
btw, it's better to random shuffle the dataset or models will be overfitting.
A general idea 1. build a reward-model (take the prompt and answer and output a reward value) based on flant5-xxl encoder with a fully connected feedforward neural network to convert...
@theblackcat102 Got it. When I finish the experiment, I will put the results here to consider whether more effort is needed.
@maw501 My experiment is over | Model with MLP| WebGPT Accuracy | | ----------- | --------- | | T5 - flan - small | 53.2%| | T5 - flan -...
@maw501 Do you have a better idea?
@maw501 Hi! 👏 I tried to replicate the reward-model base on the InstructGPT paper. I want the reward-model to decide which of two responses to the same question was better...
The results shown in the paper are pretty good. Are there any planned/ongoing code reproducing projects?
@sanagno No, just experiment t5-flan-encoder combined with the idea of rankgen as the reward model
@andrewm4894 OK, I'll put it under /docs