obstacle-tower-env icon indicating copy to clipboard operation
obstacle-tower-env copied to clipboard

Environment hangs up when spawning from different processes

Open GraphicsHunter opened this issue 5 years ago • 16 comments

Hello,

I'm trying to run the obstacle tower environment through the large-scale-curiosity project. However, it seems to hangup when it tries to create the environment from its subprocesses. It prints out that the CrashReporter is initalized and the mono config paths, then does nothing for a while and hangs up with the following image:

image

This is run on a MBP 13'' 2018, without a GPU. Any way to troubleshoot and debug this? I can't really do anything as there aren't really logged anything from inside the environment.

GraphicsHunter avatar Mar 18 '19 02:03 GraphicsHunter

Hi @tianfanzhu

Can you confirm that you are running the latest version of Obstacle Tower (v1.2)? Also, does it work when using the basic usage python notebook we provide as an example?

awjuliani avatar Mar 19 '19 17:03 awjuliani

Hi @awjuliani ,

I am indeed running this on the latest version, v1.2. Also, I found out from the basic usage notebook that the screen is gray, as shown above, until the env is reset or stepped.

GraphicsHunter avatar Mar 19 '19 21:03 GraphicsHunter

I have the same problem. iMac end 2015 osx 10.14.3

binoalien avatar Mar 21 '19 09:03 binoalien

Me too. Except I'm running this via the Unity Obstacle Tower Challenge run.py script, and at startup I see the game character appear and then fall off the blank screen into nothingness. After that, empty gray screen.

iMac running 10.14.4

NancyFulda avatar Mar 30 '19 16:03 NancyFulda

However, when I click on the obstacletower.app file directly, it runs flawlessly.

NancyFulda avatar Mar 30 '19 16:03 NancyFulda

Hi all, it may be difficult to tell whether everyone is experiencing the same issue. A couple of important things to note:

  • When running multiple environments, the worker_id value (in the environment constructor) must be set to a different integer value for each environment. This is because the gym wrapper and the Unity executable communicate with one another via GRPC over a particular port and each reserves that port.
  • When running run.py, if you are running in evaluation mode the run.py script must be launched before the environment executable.

So with that in mind: @tianfanzhu can you confirm whether your environment construction sets a different worker_id value for each environment?

@NancyFulda are you running in evaluation mode or just directly running the run.py script? If you're directly running the script, could you look for a file called UnitySDK.log in the same folder as the ObstacleTower.app file and share the contents?

harperj avatar Apr 01 '19 19:04 harperj

Hi @harperj, thanks for looking into this!

I'm directly executing run.py. Interestingly, the behavior this morning is different than it was on Saturday (maybe I rebooted in between??) I still see grayness, but the game character does not appear anymore. However, the run.py script no longer hangs, but instead prints out the reward for each episode.

Is this the expected behavior? It would be nice to be able to watch the agent's character navigate the world (to see where it's messing up), but since the environment seems to be executing at faster-than-real-time speed, maybe the grayed out screen is normal?

The UnitySDK.log contents are as follows:

4/1/2019 1:35:59 PM

Log Academy resetting

Log Seed: 52

Log Seed: 47

Log Academy resetting

Log Seed: 26

Log Seed: 91

Log Academy resetting

Log Seed: 65

Log Seed: 17

Log Academy resetting

Log Seed: 44

Log You reached floor: 1

Log Seed: 64

Log Academy resetting

Log Seed: 34

Log Seed: 58

Log Academy resetting

Log Seed: 85

NancyFulda avatar Apr 01 '19 19:04 NancyFulda

@NancyFulda This is the expected behavior. When training, the camera isn't turned on in order to improve performance. You can see the camera by turning on realtime mode in the environment (realtime_mode=True in the constructor).

harperj avatar Apr 01 '19 20:04 harperj

@harperj Ah, that worked perfectly! Everything seems to be in order now. Thank you!

NancyFulda avatar Apr 01 '19 20:04 NancyFulda

Hi @harperj @awjuliani I also encountered the same issue:

I tried to use ML-Agent 0.8.1 by simply let options['--env'] = 'ObstacleTower/ObstalceTower' and set options['--num-envs'] = 2

After launching 2 envs, 1 env had the agent just spawning and falling down, another env just 'not responding', and my cpu and gpu usage of the falling-down agent env is very high.

This issue occurs in my Windows machine (Windows10), but it has no problem with the same setting on my Mac, also I've checked that I'm using ObstacleTower-v1.3

Here's the reference video [https://youtu.be/u-J7mlwlmr0]

stevenh-tw avatar May 07 '19 13:05 stevenh-tw

I was able to get large-scale-curiosity + Obstacle Challenge working up to about 32 agents

  • make sure worker_id is unique for each instance
  • timeout_wait=6000
  • add a sleep(2) between creating each instance (i.e. 2 seconds)
  • some worker_id may clash with windows - for me i needed to add if rank >= 35: rank += 1
  • I copied the render module from OpenAI.Gym to visualize training (realtime_mode=True slows down training)

@karta1297963 what you see in your video is what happens when the Unity environment does not sync with Python. Even with everything I did above, I still see this 1 in 5 times when starting off a run (even with different code bases)

Sohojoe avatar May 09 '19 16:05 Sohojoe

Like @Sohojoe said, this looks like an issue with the connection between Python and Obstacle Tower / Unity. It could be that the port is in use for something else, that the worker_id is not being set correctly, or that the environment takes longer than the timeout_wait to start up. You could potentially have your script fail gracefully and re-launch on timeout as well, or try a new worker_id if you have a reserved port that conflicts.

harperj avatar May 09 '19 17:05 harperj

@Sohojoe @harperj thanks for helping, I've tried the solution @Sohojoe mentioned but it didn't work, later I tried to cross-validate the compatibility between mlagent-env v0.8 and unity instance built with mlagent v0.6 (like obstacle tower)

I built 2 instances with mlagent default task - Pyramids with SDK v0.6 and v0.8 respectively, turns out one with v0.6 has the same sync issue while v0.8 instance doesn't. Then I compare the git history seems like v0.8 have the ability to customize gRPC communication message, I guess it's the reason python and unity don't sync (but somehow with only 1 environment the issue doesn't occur)

I guess the possible solutions:

  1. Wait for ObstalceTower update to mlagent v0.8
  2. Use mlagent-env v0.6 and somehow make it works with mlagent v0.8 SubprocessUnityEnvironment

stevenh-tw avatar May 10 '19 04:05 stevenh-tw

@karta1297963 - what platform / OS are you using?

Sohojoe avatar May 22 '19 17:05 Sohojoe

@Sohojoe I'm using Windows 10. I currently have a workaround by using the OpenAI baseline - SubprocVecEnv class, it works! but seems like this approach cannot have the step function return both visual and vector observation at the same time.

stevenh-tw avatar May 22 '19 17:05 stevenh-tw

@karta1297963 - create a simple repro that spawns many instances as an example of how i do it - https://github.com/Sohojoe/many_towers

Sohojoe avatar May 24 '19 17:05 Sohojoe