DQN-tensorflow icon indicating copy to clipboard operation
DQN-tensorflow copied to clipboard

GPU Utilization

Open ch3njust1n opened this issue 8 years ago • 12 comments

I have a Titan X and have been running the Breakout simulation for over two days now and it's only 7% through training and nvidia-smi is showing that it's only using 4-5%. The README.md says that it only took 30 hours on a 980. That doesn't seem right. According to main.py, it should be using 100% by default if I don't give the flag. Is anyone else having this issue or is it just me? Edit: nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE shows that FB Memory Usage is 11423 MiB/ 12185 Mib. Does that look correct if using the default GPU setting for Breakout?

ch3njust1n avatar Jan 14 '17 13:01 ch3njust1n

Any luck so far solving the issue?

I am having the same problem with my GTX 1080. Its performance degrades after an hour or so. It starts with 250 it/s with estimated time to finish around 45 hours then drops to 75 it/s with estimated time around 170 hours.

infin8Recursion avatar Jan 19 '17 09:01 infin8Recursion

it/s dropping is normal since the agent learns to survive, so each game tends to take longer. I don't know if it's normal for Titan X to take only 7% load.

serialx avatar Jan 19 '17 09:01 serialx

Isn't it supposed to finish training in 24~30 hours? It did that using 980ti. However, it does not seem to be the case with Titan X and 1080 even though they outperform it.

Any suggestion about what could be causing such behavior?

@serialx Could you please share with us your setup and the time it took to finish training?

infin8Recursion avatar Jan 19 '17 10:01 infin8Recursion

@infin8Recursion No luck so far.

ch3njust1n avatar Jan 21 '17 06:01 ch3njust1n

I figured out this issue now and there may be a bug among the recent commits. I'll dig in to this and update this.

carpedm20 avatar Jan 21 '17 07:01 carpedm20

Is this issue solved yet ?

slowbull avatar Mar 05 '17 22:03 slowbull

In my case,it is also take so long time. The GPU utilisation is about 50%, but the training time need around 500hours to complete 500 000 00 steps, which almost one month~

Lan1991Xu avatar Mar 17 '17 11:03 Lan1991Xu

any update on this?

zcyang avatar Mar 20 '17 02:03 zcyang

We don't have an explicit schedule to fix this bug but I recommend you to try other great DQN implementations in TensorFlow like https://github.com/dennybritz/reinforcement-learning or https://github.com/carpedm20/deep-rl-tensorflow

carpedm20 avatar Mar 20 '17 02:03 carpedm20

Same problem here. I'm trying to use the repository https://github.com/carpedm20/deep-rl-tensorflow instead.

shengwa avatar Mar 22 '17 02:03 shengwa

Same problem for me. On my GTX1070, this repo runs at ~90 iter/sec. https://github.com/carpedm20/deep-rl-tensorflow is faster, at ~120 iter/sec, but by far the fastest implementation (at least for my hardware) is https://github.com/matthiasplappert/keras-rl , running at ~190 iter/sec. If anyone knows faster implementations, feel free to link them here. I'm looking for the fastest possible implementations since I'm doing a load of experiments for 200 million steps, and those 10 iter/sec may result in finishing one experiment half a day sooner.

ionelhosu avatar Oct 24 '17 10:10 ionelhosu

@ionelhosu Just wanted to point out that it's very hard to compare speed of DQN implementations apple-to-apple. Apart from network and the algorithm (dqn /double dqn, etc), other things can also be different. The most subtle one is "what does each iteration mean". Usually each iteration may include : going forward certain steps in the environment, by either random exploration or using a network, and maybe sample a batch and train on it. The bold parts are all controlled by hyper parameters and is hard to make consistent. Also, due to epsilon-annealing in DQN, the speed is not a constant across training, but gradually going slower as controlled by hyper parameters.

ppwwyyxx avatar Nov 10 '17 06:11 ppwwyyxx