Adam Gleave comments

Results 172 comments of


                                            Adam Gleave

Dynamic L2 regularization for preference comparisons

> I have no idea what is going on here, I am getting errors on CircleCI that seem so weird and don't show up on my local machine. Before I...

Dynamic L2 regularization for preference comparisons

Thanks for the updates! Moving `regularization/__init__.py` to some sub-module in `regularization` seems fine to me as an alternative to `util` (or `utils`...). Main question is whether we expect to add...

Dynamic L2 regularization for preference comparisons

> I wonder what actually happens to the loss gradient when adding an L1 norm penalty, since it's not differentiable. Does pytorch compute subgradients? @AdamGleave Yeah, it uses subgradients at...

Randomness control for different `exploration_frac` in preference comparisons

I agree 1) should be fixed, just making the deterministic policy consistent between both (likely defaulting to False) seems fine for now. For 2) I think we should test empirically...

Randomness control for different `exploration_frac` in preference comparisons

I don't think we want PPO to be deterministic. If I understand correctly, rollouts collected for purpose of RL training will always need to be stochastic (this is where the...

[Preference Comparison] L2 regularization with dynamic regularization coefficient

> Since L2 regularization is not the same as weight decay, should we implement L2 penalty or a weight decay? Good question, unfortunately the answer seems a bit unclear. The...

[Preference Comparison] L2 regularization with dynamic regularization coefficient

> 1. If we only plan to support Adam as an optimizer, we can write a custom optimizer class that wraps Adam and 'cleans up' the hackiness and separates the...

Inverse Q-Learning (IQ-Learn) implementation

Hi, Yes, contributions are welcome! Especially as the reference implementation looks to [not be free software](https://github.com/Div99/IQ-Learn), so having an open-source implementation of this would be valuable. Although this does mean...

Inverse Q-Learning (IQ-Learn) implementation

I think prominently placed in the documentation should be sufficient.

Optimize EMANorm by removing for loop over a batch

Thanks for the PR! Changes look reasonable to me at a high level, my suggestions are fairly minor apart and largely to do with improving clarity. I'm tagging @levmckinney to...