LMOps icon indicating copy to clipboard operation
LMOps copied to clipboard

The update method in the UCB algorithm is inconsistent with the paper and code

Open kerala21 opened this issue 10 months ago • 2 comments

Q(p) for each prompt in the UCB algorithm of the paper is updated to Q(p) + r/N(p),

Uploading 2024331203750.jpg…

The following table describes the project update code

def update(self, chosen, scores):

    for i, score in zip(chosen, scores):
        self.counts[i] += self.num_samples
        self.scores[i] += score * self.num_samples

Doesn't match

kerala21 avatar Mar 31 '24 12:03 kerala21

The jpg file is unavailable.

donglixp avatar May 10 '24 11:05 donglixp

I was also a bit confused by that part. As I understand it, r/N in the paper seems to be a typo—actually, it should be Q + (r - Q)/N. This is because, to calculate the estimated score Q, we need to update the difference between the predicted Q and the observed reward r.

If so, Q + (r - Q)/N can be rewritten as:

((N - 1)Q + r)/N

This represents the average of all the rewards obtained.

self.scores[i] stores the total sum of all scores (rewards) so far. It will then be divided by counts (to calculate the average) in get_scores() when calculating ucb_scores.

hideaki-j avatar Aug 08 '24 23:08 hideaki-j