Nathan Lambert
Nathan Lambert
Just wanted to update everyone in the issue that @bglick13 mostly and some of I made a lot of progress on this in #818 -- feel free to take a...
@luisenp do you have checkpoints of the models used in the benchmarking efforts of the paper? Or do I have to re-train to get them. Thanks for checking! (those will...
Generally my goal for this PR would be do get the code working (actually the easy part), and say you can easily run a couple agents that are pretrained with...
Closed with #178
Yeah @lvwerra it would just be an example / documentation addition I bet. Or, a more advanced option would be to explain the differences a bit for people too.
Haven't done it, happy to review your PR if you make one. Generally, I had written out psuedo code here ``` query_tensors = [query_tensor]*batch_size model.generate(query_tensors, return_prompt=training_args.return_prompt, generation_config=generation_config, ) batch["response"] =...
It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF.
Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both.
Residual value prediction didn't help with stability (it's crimson-wish)
Also not a big help via the other approx KL formulation. W&B [here](https://wandb.ai/natolambert/TRL/runs/e1258rd6?workspace=user-natolambert). Though, it's slightly more stable? We'll see how this run finishes converging.