albert
albert copied to clipboard
Ran out of memory in memory space hbm on RACE xlarge v3 on TPU v2-8
First I am getting this warning:
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Then following shortly after:
ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close()
.
E0221 21:23:45.654859 139752372131584 error_handling.py:75] Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close()
.
ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0:
Compilation failure: Ran out of memory in memory space hbm. Used 12.34G of 8.00G hbm. Exceeded hbm capacity by 4.34G.
The only code change I have made is instead of reading an all.txt file, I am reading in each individual file. cur_path_list = tf.gfile.Glob(cur_dir + "/*.txt")
I am running Tensorflow 1.15.2 with Python 3. The base set worked perfectly with a 70% but now I am unable to run xlarge on the TPU. Is the xlarge model too large for TPU v2-8? Do I need to upgrade to TPU v3-8 or a pod setup? I am using the default config file from tfhub and parameters from the READMe file with the change of the learning rate to 2e-5.
We haven't tried it on TPU-v2 version, but how about you try it without dropout? We found that remove dropout can significantly reduce memory consumption.
May I ask where did you get the TPUs? Thank you very much.