PokemonRedExperiments 10-21-23 Updated Version feedback

Seeing great performance with this update, beat the first gym leader at two hour mark!

stats: 1400 fps up from 700 10-12 gb system ram used 13% GPU usage

Oct 22 '23 01:10 Iron-Bound

Nice!! That's great. I tweaked the exploration reward scale in this version, so that could be a factor for getting it back into Mt moon.

Oct 22 '23 02:10 PWhiddy

@Iron-Bound how do you specify to run on GPU?

Oct 22 '23 06:10 RWayne93

It's picking it up automatically.

For details, I'm running a 13700k/32gb ram/7900 xtx system with the rocm 5.7 container.

Oct 22 '23 08:10 Iron-Bound

It's picking it up automatically.

For details, I'm running a 13700k/32gb ram/7900 xtx system with the rocm 5.7 container.

Then it doesn't like ether of my NVIDIA cards. It hasn't touched my cards with the old or new version.

Oct 22 '23 08:10 setomage

Ran the new "fast" and it's running better. Running it with 24 cores has me at 47% from 80-85% CPU and 53GB of RAM from about 90GB.

Sadly old training data can't be used,

Seen in the code that a wight was given for exploring, which is why the system is getting out of the starting room faster and playing the game. But with the wight in place, this means that the AI will prioritize exploring over everything else, instead of everything being equal and letting it decide how to focus on the game... So took it out of my model, for a cleaner study.

Oct 22 '23 09:10 setomage

Then it doesn't like ether of my NVIDIA cards. It hasn't touched my cards with the old or new version.

It should mention using cuda when starting the script, can you run pytorch and check cuda is available?

import torch

is_cuda = torch.cuda.is_available()
print(is_cuda)

Oct 22 '23 09:10 Iron-Bound

Then it doesn't like ether of my NVIDIA cards. It hasn't touched my cards with the old or new version.

It should mention using cuda when starting the script, can you run pytorch and check cuda is available?
import torch

is_cuda = torch.cuda.is_available()
print(is_cuda)

I know cuda is usable on my server, more so since I have AI running on it. But ran the test as you asked, and it came back true.

Oct 22 '23 09:10 setomage

@setomage that does sounds weird also you said you have multiple cards, could test the inference script for usage and second see if you can set device used in the PPO function or anywhere.

Oct 22 '23 09:10 Iron-Bound

Noticing that after the first run in the fast version, that the system isn't starting over again, like it use to. running it again, and will see what happens over night.

@Iron-Bound I'll try your idea in the morning/afternoon.

Oct 22 '23 09:10 setomage

I had many of my own updates to the original baselines and environment files, such as changing the logging output to show each of the cpus updating rewards in real time, but I saw you posted this updated version that is meant to be faster, but im finding the same result (although to give you credit, its true that this result happens much faster after adding in your updates).

What happens for me is the ai after an iteration, figures out going outside of Red's house is good, then within 1-2 iterations finds its way to the grass patch and is picked up by Prof Oak. He even selects the pokemon with precision (and I have it rewarding double if it chooses squirtle because I want to watch it learn to play with squirtle).

But then what happens is about 50% of the time, after 2-3 iterations, it is beating Blue in the rival battle and then starts exploring outside the lab. I watched a couple of the non-headless runs that made it to route 1 and faced some ratatta and pidgey and were rewarded heavily for it, but thats it. It seems to be the farthest it gets no matter what metric I try or how long I wait. I leave it on for a couple hours after that and soon I see it reverting and becoming less and less likely to even make it to the grass patch event with Oak. If I let it go over night, most of the time they don't even make it out of the first room.

I'm new to Reinforcement Learning but my understanding is this has to do with the entropy/randomness/exploration of the algorithm weights? This is a very complex game with near infinite combinations of buttons to press once you're over 5k steps, and in order for this to get far it should be a lot less random so it continues to follow the path it learned was incredibly rewarding. Am I the only one experiencing this? My goal is to eventually have it do a complete run from start to finish, so I am starting from the init state since it feels cheaty to start after getting the pokedex already. Is everyone else just doing the skip and that's why my ai keeps doing this? It doesn't make sense to me. The reward system is working just fine. It gets like 4 times the reward amount when it goes the way we want it to, yet it just ignores it for the most part at a certain point. Very disheartening. :(

Oct 23 '23 13:10 ronaldt-git

Nice!! That's great. I tweaked the exploration reward scale in this version, so that could be a factor for getting it back into Mt moon.

I haven't grabbed the new version yet, but I have been fiddling with a negative reward for making a movement that doesn't move the player to encourage better exploration over more exploration. I am happy to share the memory addresses I used. There is more to add into the logic that I haven't added yet. It still does things like give a negative exploration reward when pressing a directional button during a battle start, which I don't want. I also haven't refined the negative reward amount enough to really discourage walking into a wall for 50 frames. I have also been toying with the idea of a positive reward for catching new pokemon or evolving a pokemon. Putting this in may encourage the model to catch new pokemon and (hopefully) use them. I am also getting some beefier hardware to run this on so I can get more training in to see if it just takes more iterations to get things moving.

Oct 24 '23 07:10 Selipnir

I haven't grabbed the new version yet, but I have been fiddling with a negative reward for making a movement that doesn't move the player to encourage better exploration over more exploration. I am happy to share the memory addresses I used. There is more to add into the logic that I haven't added yet. It still does things like give a negative exploration reward when pressing a directional button during a battle start, which I don't want. I also haven't refined the negative reward amount enough to really discourage walking into a wall for 50 frames. I have also been toying with the idea of a positive reward for catching new pokemon or evolving a pokemon. Putting this in may encourage the model to catch new pokemon and (hopefully) use them. I am also getting some beefier hardware to run this on so I can get more training in to see if it just takes more iterations to get things moving.

Are you starting from init state or from the skip having a pokedex already? I'm starting at init and tried a couple of the things you mentioned but it still rarely gets to route 1 after days of training with almost no progress at all. it takes only 2 iterations for it to get to taking the starter pokemon, but it never seems to learn from that and then continue onward from it. it just seems to randomly get to that point or slightly further over and over again and im not sure how what I'm doing is any different from everyone who is making progress

Oct 25 '23 02:10 ronaldt-git

I haven't grabbed the new version yet, but I have been fiddling with a negative reward for making a movement that doesn't move the player to encourage better exploration over more exploration. I am happy to share the memory addresses I used. There is more to add into the logic that I haven't added yet. It still does things like give a negative exploration reward when pressing a directional button during a battle start, which I don't want. I also haven't refined the negative reward amount enough to really discourage walking into a wall for 50 frames. I have also been toying with the idea of a positive reward for catching new pokemon or evolving a pokemon. Putting this in may encourage the model to catch new pokemon and (hopefully) use them. I am also getting some beefier hardware to run this on so I can get more training in to see if it just takes more iterations to get things moving.

Are you starting from init state or from the skip having a pokedex already? I'm starting at init and tried a couple of the things you mentioned but it still rarely gets to route 1 after days of training with almost no progress at all. it takes only 2 iterations for it to get to taking the starter pokemon, but it never seems to learn from that and then continue onward from it. it just seems to randomly get to that point or slightly further over and over again and im not sure how what I'm doing is any different from everyone who is making progress

I have been starting from the skip of having a pokedex already. The tough part of before the pokedex is the backtracking and the logic I use is not conducive to the large amount of talking Oak does

Oct 25 '23 02:10 Selipnir

Are you starting from init state or from the skip having a pokedex already? I'm starting at init and tried a couple of the things you mentioned but it still rarely gets to route 1 after days of training with almost no progress at all. it takes only 2 iterations for it to get to taking the starter pokemon, but it never seems to learn from that and then continue onward from it. it just seems to randomly get to that point or slightly further over and over again and im not sure how what I'm doing is any different from everyone who is making progress

Training will depend on how many games you're running at a time, the ep_lengh, and how long you keep going (continuing training from the most recent data).

I'm running 24instantiates (24CPU) with 8192 steps (ep_lenth 8192 * 10), and I run it almost all the time. Since the fast update came out, I'm getting 200 - 400 points for run right now. With that said, I'm thinking it'll be about a month to get something like what @PWhiddy showed on YouTube.

So depending on how many games you're running and how many steps, this will greatly change how fast the AI learns.

Oct 25 '23 03:10 setomage

Are you starting from init state or from the skip having a pokedex already? I'm starting at init and tried a couple of the things you mentioned but it still rarely gets to route 1 after days of training with almost no progress at all. it takes only 2 iterations for it to get to taking the starter pokemon, but it never seems to learn from that and then continue onward from it. it just seems to randomly get to that point or slightly further over and over again and im not sure how what I'm doing is any different from everyone who is making progress

Training will depend on how many games you're running at a time, the ep_lengh, and how long you keep going (continuing training from the most recent data).

I'm running 24instantiates (24CPU) with 8192 steps (ep_lenth 8192 * 10), and I run it almost all the time. Since the fast update came out, I'm getting 200 - 400 points for run right now. With that said, I'm thinking it'll be about a month to get something like what @PWhiddy showed on YouTube.

So depending on how many games you're running and how many steps, this will greatly change how fast the AI learns.

I understand the metrics you're talking about but that doesn't increase the efficiency of the algorithm/policy, thats just going to speed it up. I'm talking about something wrong with the policy. I'm not getting to the point where they'd have to give the parcel back to oak. that's a problem for another day. The bigger problem i'm having is it appears to not follow the current maximum nearly enough. Like once one core gets 10x everyone else (i.e. 250 points in 4k steps compared to like 15), more and more cores should follow that in each following iteration. Obviously it can't do that 100% of the time because it would get stuck in a local maximum, but it seems to almost never actually follow the path that was by far most rewarding. Like im sure its possible that after a year of it trying almost every combinations of buttons it will make it past mt moon, but that isnt really learning at that point its basically process of elimination. The policy should be having a higher bias toward existing high reward paths, but everything I've tried hasn't worked to help it. Was hoping someone with more RL experience would know

Oct 25 '23 03:10 ronaldt-git