PokemonRedExperiments icon indicating copy to clipboard operation
PokemonRedExperiments copied to clipboard

RLHF to train reward model?

Open Iron-Bound opened this issue 2 years ago • 3 comments

In terms of reward function, would we be interested in using RLHF too train a dedicated model for reward? from my research we can do this by either:

Have a human rank the small clips of game play and select the preferred one.

Use video from a Speedrun or human playing live.

Given my training got stuck in OAK's lab for 50 iteration.

I've been thinking how to reward things without hard coding: running away when low on health, avoiding trainers, one way paths, avoid buying that magic carpet, etc..

Iron-Bound avatar Oct 22 '23 10:10 Iron-Bound

Do you still get stuck in the lab with the new fast training script? It should get out of there much more quickly.

But yes, I have been thinking a bit about reward modeling / rlhf, and that would be really cool! It certainly would be a very serious amount of work to set up and get working, but could potentially address a lot of challenges, would require a ton of labeling, but opens up the chance to involve a lot more non technical folks who are interested in contributing to the project. Brings back more of the "twitch plays pokemon" elements.

PWhiddy avatar Oct 22 '23 16:10 PWhiddy

Do you still get stuck in the lab with the new fast training script?

It's much better now and a welcome surprise 😁

Brings back more of the "twitch plays pokemon"

Sentdex did a GTA 5 bot, with reset function also.

ATM I'm trying to find existing frameworks to do the HF part of this and the closest has been in robotics.

I'm thinking maybe the interactive mode could be modified as well or we could do a sandbox to train Mt moon?

Iron-Bound avatar Oct 22 '23 17:10 Iron-Bound

I think it does not necessarily require a ton of labeling but will need the game to have long-term memory

trantrikien239 avatar Nov 08 '23 18:11 trantrikien239