LMOps
LMOps copied to clipboard
The update method in the UCB algorithm is inconsistent with the paper and code
Q(p) for each prompt in the UCB algorithm of the paper is updated to Q(p) + r/N(p),
The following table describes the project update code
def update(self, chosen, scores):
for i, score in zip(chosen, scores):
self.counts[i] += self.num_samples
self.scores[i] += score * self.num_samples
Doesn't match
The jpg file is unavailable.
I was also a bit confused by that part. As I understand it, r/N in the paper seems to be a typo—actually, it should be Q + (r - Q)/N. This is because, to calculate the estimated score Q, we need to update the difference between the predicted Q and the observed reward r.
If so, Q + (r - Q)/N can be rewritten as:
((N - 1)Q + r)/N
This represents the average of all the rewards obtained.
self.scores[i]
stores the total sum of all scores (rewards) so far. It will then be divided by counts (to calculate the average) in get_scores()
when calculating ucb_scores
.