Nathan Lambert
Nathan Lambert
What's the right place to add best of n sampling and compare its impact to some existing methods? Some references: * Discussed in [reward model scaling laws paper](https://arxiv.org/abs/2210.10760), * OpenAI...
Will share results, but experiments for #101 #122 #121
I'm comparing the PPO implementation to the OpenAI one and the [implementation details blog post](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) that goes through it. Wondering if some of these things improve performance. If not, it's...
Two changes: 1. Pass the optimizer in the sentiment example (currently variable was not passed into trainier). 2. [I think] fix the kwarg option for wandb config of `Accelerate`. See...
In the toxicity [script](https://github.com/lvwerra/trl/blob/b75d83ab28b59307916beb425207d46406502f11/examples/summarization/scripts/reward_summarization.py) should the `optimizer` be passed to the PPOTrainer -- or omitted? Found this because I'm dealing with optimizer setup for H4 by copying the code over....
Installing basic from source with pip does not install the quality / style requirements.
Essentially, how do we do this for a packaged Simulate environment? E.g. https://stackoverflow.com/questions/52727233/how-can-i-register-a-custom-environment-in-openais-gym
Collider-meshes for non-convex polygons currently require re-building a polygon out of invisible components or an advanced integration of a V-HACD algorithm for re-constructing a non-convex mesh as a convex set...
Not sure the best way to handle this in `setup.py`