gated-attention-reader Very slow on GPU

Very slow on GPU

Open EvGe22 opened this issue 7 years ago • 1 comments

I'm trying to train the net using a Tesla K80 and the performance is pretty sad. I get around 8 seconds per iteration. I've seen the same issue in the original theano implementation repo and there the problem was either with float precision or the blas lib. I've got a blas lib in place and I don't think the float problem is relevant to TensorFlow since we use float32 everywhere already. What do you think can cause this?

I've also encountered a problem with training crashing while running on the validation set and/or saving the model. And It's a bit random. It crashed on eval_every=10000 and 5000, but didn't on eval_every=100. Cannon provide any error messages since I didn't capture any. I don't think it's related to memory since I'm actually using two K80s with total memory of 24Gb. The RAM isn't the case also, I've got plenty of it. Any ideas?

Just saving the model without running the validation set on eval_every=1500 works right now, didn't get any errors yet.

The CUDA version is 7.5 and cuDNN is 5.1.3. Should I just update it maybe? TF version is r1.3 Python 3.6 and the libraries are from the latest anaconda3

Jul 26 '17 12:07 EvGe22

I think I've found one of the reasons that might cause a crash when saving the model. If it was restored from a checkpoint, this will throw an exception since the variable best_acc wasn't initialised. Correct me if I'm wrong

Jul 26 '17 14:07 EvGe22

gated-attention-reader gated-attention-reader copied to clipboard

Very slow on GPU

gated-attention-reader
gated-attention-reader copied to clipboard