following-instructions-human-feedback icon indicating copy to clipboard operation
following-instructions-human-feedback copied to clipboard

A Question on RM training

Open DaehanKim opened this issue 2 years ago • 0 comments

Dear authors, I'm an AI engineer deeply impressed by your work. Great work and thank you for sharing this!

btw, I was confused on what you meant by "train on all K-choose-2 comparisons from each prompt as a single batch element". For my understanding, you collected K response ranks (collected from K models separately) from a prompt and made k-choose-2 comparisons out of them. These k-choose-2 samples constitutes a batch (which results in 6(K=4) to 36(K=9) batch sizes) and you do forward pass with this batch. But I don't get why this requires a single forward pass for each completion(meaning a single response as I see). As I understand, number of samples consumed are the same as before and the difference is in the batching scheme. I guess I'm a bit mistaken.

DaehanKim avatar Apr 30 '22 06:04 DaehanKim