hao cheng comments

Repositories
Issues
Comments

Results 3 comments of


                                            hao cheng

The loss in reward_model.py

> Hi, do you mean use the average token reward as the score instead of the last token? For here, the mean should be out of the sigmod function since...

The loss in reward_model.py

> In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension. Thanks for the reply. I still have confusion. I printed the...

The loss in reward_model.py

> > > In our case, it is also a scalar. The vector is from batch dimension instead of seq-length dimension. > > > > > > Thanks for the...