ml-agents icon indicating copy to clipboard operation
ml-agents copied to clipboard

Human-in-the-loop and/or Reward modeling

Open DrTtnk opened this issue 4 years ago • 4 comments

Is your feature request related to a problem? Please describe. Defining a reward function may be complex or just impossible in some cases (ie: an agent making a back-flip or a natural walk) or, in other cases, the agent could hack a sub-optimal or even an optimal solution to the reward function if the reward is not properly tweaked, which may lead to very hard time to debug it (ie: see last link)

So: why not to use the rest of the game engine, inputs system included?

Describe the solution you'd like A possible solution would be to implement the OpenAI paper "Deep Reinforcement Learning from Human Preferences" for training and adding the bare minimum of functionality to the ML framework to let sample from the current solution space and waiting for the user input to improve the hard-coded reward function.

Hopefully this would need some little changes to the train function to adapt it to a more general reward function The hardest part would be the inference during the training as described in the paper

The perfect solution would be having the option to have the option to put a human-in-the-loop for every kind of training present in the Unity ML framework, with the possibility to decide if and when the agent need a user feedback to aid the base reward function

Describe alternatives you've considered DIY way

Additional context Tow-minutes paper intro: https://www.youtube.com/watch?v=WT0WtoYz2jE OpenAI blog post: https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/ Original paper: https://arxiv.org/pdf/1706.03741.pdf Very simple example of agent haking a sub-optimal solution for an imperfect reward function: https://youtu.be/2gtAnyCpLnM?t=215

DrTtnk avatar Jul 10 '20 00:07 DrTtnk

Hi @DrTtnk

Thank you for the request. This is an interesting feature that I will discuss with the team. I will update this thread with a resolution in the next week.

In the meantime, do you have a particular behavior for which you cannot design a reward function?

andrewcoh avatar Jul 10 '20 17:07 andrewcoh

Hello @andrewcoh

I'll definitely look forward for you and your team response

And about some examples:

  • anything that matches the definition of "It looks natural", "It's something that I like" or "it feels right" which are definitely hard to define in a strictly mathematical and algorithmic definition, for instance, in the unity arm example I can define the reward function "move toward the point in space X" in C#, but the reward function "and move in an elegant fashion" may be more challenging. I may define one using the acceleration of the joints and the end functor, the derivative of the acceleration and many other parameters, but, apart that I have no idea if these are the things I'm looking for to integrate in the reward function, they are also free parameters that I have to tweak during multiple training sessions and they may still lead toward sub-optimal solutions or reward hacking by the agent

  • Another example: A bouncer, one leg, three joints, a weight on top, I want to make it jump toward a goal with a "natural looking" bouncing, I can define a relatively simple reward that takes into account the jumping length and tweak it until I like it, long but not impossible, problem arises if I want to do the train the model with the whole class of possible jumpers, using variable weights and links length, now the reward function has to be based on the mechanical parameters of the body and it needs to be heavily modified to fit the new requirements (a 20 meters tall and weighting ten tons jumper should look different from a jumper 100 times smaller and 1000 times lighter, what would be an ideal reward function to model these behaviors at the same time?)

  • Pure debug: this may be a simple side effect of the human-in-the-loop, where, during training, I can improve my requests deciding if the agent is performing well or not, without an explicit function and, then I can either derive the an old fashion reward function observing the results, or I can just use the trained model as-is running the training for a few hours/days, instead of weeks and multiple re-training for the more challenging task to define, like throwing a ball: a reward function may be "the ball has to reach the target as fast as possible" I may realize after a few hours that the addendum "and the agent cannot leave the throwing area" is needed to block the agent to simply run toward the target, but it's likely that I'll realize this only after a whole training session and I'll need to update and redo everything for more hours Then, if I want to use and extended the reward function with "And the ball has to bounce N times on the floor before reaching the target" I may have take into account many more variables to have the expected outcome

DrTtnk avatar Jul 11 '20 10:07 DrTtnk

Hi @andrewcoh

Any news from you and your team?

DrTtnk avatar Jul 24 '20 11:07 DrTtnk

hi @andrewcoh (part 2) :D

Still hoping for some news from your team, either as accepted or rejected

DrTtnk avatar Jul 02 '21 13:07 DrTtnk

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Nov 04 '22 20:11 github-actions[bot]