LuxPythonEnvGym icon indicating copy to clipboard operation
LuxPythonEnvGym copied to clipboard

[Discussion] The learning design

Open nosound2 opened this issue 4 years ago • 23 comments

I think about the learning design that is implemented here, and I just can't resolve to myself two questions. The core function for the learning is the environment step function. The chain of learning is [OBS_UNIT1 -> ACTION1 -> REWARD -> OBS_UNIT2 -> ACTION2 -> OBS_UNIT3 -> ACTION3 ... -> ALL TURN ACTIONS ARE ACTUALLY TAKEN] -> [THE SAME FOR THE NEXT TURN ...]. The questions are:

  1. Less important. Only the first action gets reward. Doesn't it create significant problems, especially when the number of units per turn is big? Especially if the discount factor gamma is small, but also in general. Even this intermediate reward for most actions is delayed. I wonder how much harder the life is for the model because of this. One thing, - the ordering of the units to act can be important. I can imagine that the model can handle it. But is there an example of multi-unit problems that are designed like this?

  2. More important. The algorithms like TD(0), Q-Learning, and more involved like PPO, all depend for the model update not only on the current state (or state-action pair) but also the next one. But the next step is a different unit, its observation is unit-dependent, its value function is completely different, and barely related. The process is basically not markovian, the states are heavily incomplete information, and each time different incomplete information. Isn't it a no-go? Or I miss-understand something major?

Please share your thought!

nosound2 avatar Oct 01 '21 18:10 nosound2

Some personal thoughts, I am not an expert on this:

  1. Personally I think this part is OK because of the large gamma. In many cases it's quite common to get the reward many steps afterwards of the action that caused the reward. The default gamma is 0.995, so if my understanding is correct an action will get a 0.995^100=0.61 factor of a reward given 100 steps later. This matches up pretty closely with the OpenAI Five setup where micro-rewards are sparse with large gammas: https://openai.com/blog/openai-five/
  2. In the case of PPO it seems to have two critic model components. It predicts the action value of each possible action given current state. I think makes sense for our scenario, it takes current unit observations and guesses how likely the actions will effect the discounted reward. But the advantage function is a critic model predicting just value given state. This second part it seems is the state value function, where like you said I think it's more of a problem. It's comparing the value of the current state versus the next step's state, and extends it into the future with the gamma discount factor. Part of the observation of the example agent was included the values like num_units, num_cities, etc to purposely help with this state value estimate, but these only change each turn (not each action) like you said. So let's say we had an action that was good, I guess the advantage value from the action would be approximately something like `(value(state current)-value(state 1 step ahead))+gammavalue(state 2 step ahead)-value(state 3 step ahead)+gamma^2...). Imagine it's action was building a city, and we get the reward for building a city ~4 steps later once the next turn starts, it'll still be included in the advantage calculation of that action despite it being a bunch of states earlier. So maybe the key here is just to make sure to include lots of good game-observations outside of the unit-observations so that the state value critic model can perform better? I had been hoping my CNN experiments where the whole map was driving the observation would work better, since it can fully-observe the state of the game at each decision - unfortunately I couldn't get it to outperform the simple models. Also worth noting is that the OpenAI Five model had incomplete state information, but not the swapping of the observed unit at each step (they had one head of the model per Dota hero, mapping this to our problem would be very hard since we have a variable number of heads we'd have to attach+detach as units get created and die).

glmcdona avatar Oct 01 '21 21:10 glmcdona

Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work.

The CNN approach is what I also want to test. Have you just used CnnPolicy or something else? I want to take the network from that famous imitation learning CNN notebook and plug it instead of that one. If I checked correctly CnnPolicy is only 4 CNN layers. Additionally, I want the value part to have the same input per turn, and give additional input (unit location) only to the action part. I will let you know if it works for me!

nosound2 avatar Oct 02 '21 11:10 nosound2

@nosound2. Yeah, I think the built-in default CnnPolicy isn't a good fit. You can define your own layers: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html

I just shared an example notebook with you on Kaggle that I've been using. I've tried a few architectures, and the latest one is inspired by that imitation learning notebook model layout. Note, although it did work, it didn't get to as high a reward solution as the simple non-CNN example in this repo. I'm personally working on implementing a solution more similar to the OpenAI Five observation setup now.

glmcdona avatar Oct 02 '21 16:10 glmcdona

Ok, very interesting, I am reading your notebook now. Just a small remark, I believe theoretically it is called "private sharing", which is not allowed. Let's refrain from this in the future (as long as we are not in the same team!).

nosound2 avatar Oct 02 '21 16:10 nosound2

Oh wow, I didn't know we couldn't share code with each other if we aren't on the same team! Thanks for the heads up.

I'll get a proper run of that notebook done and share it public.

glmcdona avatar Oct 02 '21 16:10 glmcdona

A few comments on that notebook

  1. Good that you didn't use compressed_map_observation at the end ;). It is probably not good because of what we discussed.
  2. I like how you build the observation layers.
  3. The CNN model that you built is very strange. Three nn.Conv2d layers in a row without activations in between, no batch norm, two max pools, and none of them is at the end, no skip connections. It is far away from all designs that I know.
  4. It is nice how you allow passing different types of observations, I will try to use it too. But do you use only self.obs['map']? For example all these global arguments, like night/day etc., can be good to concatenate to the output of CNN, instead of creating a separate layer. It seems like everything is ready for this too.

nosound2 avatar Oct 02 '21 18:10 nosound2

Are you on the competition discord server @nosound2 ?

Regarding the architecture and whether or not the usage of skip/residual elements. The current "miner-state' has ~100 values (order of magnitude), any output of a CNN feature extractor is likely to be >10k values. Fancy architectures are great but the training time (and hyperparameters selection) is getting quickly out of hand (at least from my attempts).

I'm currently trying to inject as much human-knowledge as it is reasonable in the observation to reduce what has to be learned from scratch to improve training speed.

royerk avatar Oct 02 '21 18:10 royerk

  1. The CNN model that you built is very strange. Three nn.Conv2d layers in a row without activations in between, no batch norm, two max pools, and none of them is at the end, no skip connections. It is far away from all designs that I know.

This is similar to a basic VGG16 model architecture, though looks like it should run a relu every 3x3 conv, eg: https://neurohive.io/en/popular-networks/vgg16/

  1. It is nice how you allow passing different types of observations, I will try to use it too. But do you use only self.obs['map']? For example all these global arguments, like night/day etc., can be good to concatenate to the output of CNN, instead of creating a separate layer. It seems like everything is ready for this too.

Yup, you are describing an earlier version of that notebook! I modified it to incorporate everything into the CNN layers to more closely match the imitation learning setup in case it helped. The original design had them added at the FC layer instead of adding them as layers to the CNN input.

glmcdona avatar Oct 02 '21 19:10 glmcdona

Here is the example notebook shared now: https://www.kaggle.com/glmcdona/python-environment-ppo-cnn-rl-example

Note that for kaggle submission, the main_lux-ai-2021.py needs to edited to include specifying the feature extractor in the model load operation, eg something like this:

from agent_policy import AgentPolicy, CustomCombinedExtractor
...
policy_kwargs = dict(
      features_extractor_class=CustomCombinedExtractor
)
model = PPO.load(f"model.zip", policy_kwargs=policy_kwargs)

glmcdona avatar Oct 02 '21 20:10 glmcdona

Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work.

The CNN approach is what I also want to test. Have you just used CnnPolicy or something else? I want to take the network from that famous imitation learning CNN notebook and plug it instead of that one. If I checked correctly CnnPolicy is only 4 CNN layers. Additionally, I want the value part to have the same input per turn, and give additional input (unit location) only to the action part. I will let you know if it works for me!

The MLp only has 4 layers 2 layers of 64 for both the actor and the critic.

The CnnPolicy only works good on images. The api gives us all of the information without any of the noise. Cnn approach would never be able to determine if there were multiple workers on city tile for example.

goforks12 avatar Oct 02 '21 22:10 goforks12

Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early.

goforks12 avatar Oct 03 '21 03:10 goforks12

Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early.

Not sure what would cause this. Do you have a copy of the error by any chance? Is it a memory leak, out of memory error?

glmcdona avatar Oct 03 '21 03:10 glmcdona

Fun fact: image

All have same reward function:

  • White: some reference
  • Blue gamma_0: higher episode length, lower reward
  • Orange gamma_1: lower episode length, higher reward

I still have to benchmark them.

royerk avatar Oct 03 '21 21:10 royerk

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

goforks12 avatar Oct 04 '21 05:10 goforks12

Hate to nag, but recording playing command does not seem to work and the new updated files dont compile on the kaggle server for submissions

goforks12 avatar Oct 04 '21 08:10 goforks12

Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.

nosound2 avatar Oct 04 '21 09:10 nosound2

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

nosound2 avatar Oct 04 '21 09:10 nosound2

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.

goforks12 avatar Oct 04 '21 16:10 goforks12

Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.

lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000

I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation?

goforks12 avatar Oct 04 '21 16:10 goforks12

Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.

lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000

I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation?

lux-ai-2021 is a command added by the official Lux AI repo, check out the installation instructions here if the command isn't found in your environment: https://github.com/Lux-AI-Challenge/Lux-Design-2021

glmcdona avatar Oct 04 '21 16:10 glmcdona

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.

If you didn't modify agent_policy.py to create your own agent yet, then I suspect there must be a rare game engine bug case where the Game.cities list is somehow not accurate, where it points to a City that actually doesn't belong to it's cell anymore. I'll have a quick look through to code to see if I can spot anything. As a workaround, you can add a try/except to the get_observation() function in agent_policy.py to ignore and log errors.

glmcdona avatar Oct 04 '21 16:10 glmcdona

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.

If you didn't modify agent_policy.py to create your own agent yet, then I suspect there must be a rare game engine bug case where the Game.cities list is somehow not accurate, where it points to a City that actually doesn't belong to it's cell anymore. I'll have a quick look through to code to see if I can spot anything. As a workaround, you can add a try/except to the get_observation() function in agent_policy.py to ignore and log errors.

i was doing an obscenely log training period. Will use shorter times now.

goforks12 avatar Oct 04 '21 20:10 goforks12

Here is an example training run from an 'okay' RL personal agent I've built. Notes:

  • This is the 'classic' reward function (the one here https://www.kaggle.com/glmcdona/reinforcement-learning-openai-ppo-with-python-game/notebook#Define-the-RL-agent-logic).
  • Opponent is a dummy agent that does nothing.
  • My agent here is a private version trying to get closer to the OpenAI Five approach.
  • This is CPU-only training after about 24 hours.
  • 50 FPS
  • I don't include the ep_len_mean plot below, because it makes heavy use of action sequences, so they aren't very comparable.

Learning curve for a few batch sizes (n_steps is set to batch_size for each one): image

image

Here are a couple replay files of the trained agent from the batch_size==10000 run, it's not great: replays.zip Unzip and you can view the replays here: https://2021vis.lux-ai.org/

glmcdona avatar Oct 16 '21 05:10 glmcdona