batch-ppo
batch-ppo copied to clipboard
GPU doesn't seem to work
I've set use_gpu = True, but the GPU useage is almost close to zero when running the code. When I look into tensorboard, it shows that all operations are assigned to CPU. Then I disable sess_config = tf.ConfigProto(allow_soft_placement=True) and force it running on GPU, the system console throws an error as:
`INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\Code\PythonScripts\DeepRL\BatchPPO\20180308T091941-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.
2018-03-08 09:19:41.315004: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-08 09:19:41.595863: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.64GiB
2018-03-08 09:19:41.596493: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
INFO:tensorflow:Graph contains 42003 trainable variables.
2018-03-08 09:19:57.811479: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
Traceback (most recent call last):
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1323, in _do_call
return fn(*args)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1293, in _run_fn
self._extend_graph()
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "D:\Anaconda3\envs\py35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Switch: GPU CPU
VariableV2: CPU
Identity: CPU
Assign: CPU
RefSwitch: GPU CPU
ScatterUpdate: CPU
AssignAdd: CPU
[[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
Caused by op 'ppo_temporary/episodes/Variable', defined at:
File "E:/Code/PythonScripts/DeepRL/BatchPPO/agents/scripts/train.py", line 163, in
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'ppo_temporary/episodes/Variable': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. Colocation Debug Info: Colocation group had the following types and devices: Switch: GPU CPU VariableV2: CPU Identity: CPU Assign: CPU RefSwitch: GPU CPU ScatterUpdate: CPU AssignAdd: CPU [[Node: ppo_temporary/episodes/Variable = VariableV2container="", dtype=DT_INT32, shape=[10], shared_name="", _device="/device:GPU:0"]]`
It seems that tensorflow does not allow assign an int type variable on GPU.
BTW, it runs on Windows 10, and the version of tensorflow is 1.4
Hi @fengredrum. In case this is still an issue, could you try wrapping your network implementation in a with tf.device('/gpu:0') block?
The neural network is assigned to GPU, which I've checked in TensorBoard. The problem occurs in agent/ppo/memory.py. Cause self._length is a int32 type variable. I try to initialize it as a tf.float32 type variable then using tf.to_int32 bypass this problem. The code works fine on CPU. However, When implemented on GPU, it seems doesn't learn anything. Maybe there are some elusive bug in tensorflow? LOL
Thanks for providing more details. I don't think the replay buffer should be placed on GPU, since it can grow quite large, especially when training from pixel observations. All ops should default to CPU because of the with tf.device('/cpu:0') block in train.py, and the network an RNN states should be specifically assigned to GPU inside ppo.py. Is that not what is happening for you?
I tried running the default pendulum trainer. When I turn use_gpu on, it freezes during step 0 with no error. It runs fine otherwise. TF runs fine with my GPU on other operations.
Tensorflow v1.8, Ubuntu 18.04, Nvidia GTX 1080, Cuda 9.0.
If my understanding is correct, Tensorflow will automatically give priority to GPU on supported ops. Perhaps the use_gpu option should be removed if it doesn't work.
I agree with your opinion, the transitions might be very large when training from pixel observations. I used to think storing them in GPU memory to alleviate the communication cost between CPU and GPU. However, It turns out that it may not be the optimal solution when training in a single machine. Thank you for your constructive view.
@colinskow Could you try running without environment processes (--noenv_processes), please? When there is a crash in one of the processes it can cause the program to deadlock before anything is printed.
@fengredrum The collected episodes should be stored on CPU memory in almost all scenarios. Is this not the case? We have the config options batch_size and chunk_length if you don't want to train on the full batch of episodes in order to fit network activations on the GPU.
Yes, it works exactly as your description. I'm implementing Batch-PPO based on my understanding. I've added several DL tricks on it to improve stability and performance. Still debugging, hope it can achieve or even beyond your score, LOL.
I am currently trying to perform training using the GPU as-well, however I am experiencing different issues than those listed above. My issue is that at a certain step in training I believe when it is going to update the global network. When this happens and I run on CPU everything is fine, I get some kl cutoff prompts and then it continues. However when I run on GPU everything just stops, CPU usage goes to zero, GPU usage stays at zero and no prompts in the terminal or anything else happening (for over half an hour after which i gave up).
What I changed: I added my own custom environment and network
What I tried: --noenv_processes argument, did not change anything still no activity --switch to using default feed forward categorical, no change --checkout clean clone and run pendulum config with gpu enabled, after the first Phase train prompt for step 0 no more output and no more cpu or gpu activity.
What I'm running on: ubuntu 16.04 tensorflow 1.10 cuda 9.0