albert icon indicating copy to clipboard operation
albert copied to clipboard

Ran out of memory in memory space hbm on RACE xlarge v3 on TPU v2-8

Open theword opened this issue 5 years ago • 2 comments

First I am getting this warning:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Then following shortly after:

ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close(). E0221 21:23:45.654859 139752372131584 error_handling.py:75] Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close(). ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: Compilation failure: Ran out of memory in memory space hbm. Used 12.34G of 8.00G hbm. Exceeded hbm capacity by 4.34G.

The only code change I have made is instead of reading an all.txt file, I am reading in each individual file. cur_path_list = tf.gfile.Glob(cur_dir + "/*.txt")

I am running Tensorflow 1.15.2 with Python 3. The base set worked perfectly with a 70% but now I am unable to run xlarge on the TPU. Is the xlarge model too large for TPU v2-8? Do I need to upgrade to TPU v3-8 or a pod setup? I am using the default config file from tfhub and parameters from the READMe file with the change of the learning rate to 2e-5.

theword avatar Feb 21 '20 21:02 theword

We haven't tried it on TPU-v2 version, but how about you try it without dropout? We found that remove dropout can significantly reduce memory consumption.

Danny-Google avatar Mar 27 '20 23:03 Danny-Google

May I ask where did you get the TPUs? Thank you very much.

guotong1988 avatar Apr 16 '20 11:04 guotong1988