albert icon indicating copy to clipboard operation
albert copied to clipboard

Exceeding Memory

Open xiamengzhou opened this issue 5 years ago • 1 comments

INFO:tensorflow:out/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it. 242 I0125 21:17:40.027305 139845646956352 checkpoint_management.py:95] out/model.ckp t-0 is not in all_model_checkpoint_paths. Manually adding it. 243 slurmstepd: error: Job 247071 exceeded memory limit (212183588 > 209715200), being killed

I was fine-tuning RACE dataset over an ALBERT large model on a slurm server, but always got the error of exceeding memories. Already enlarged the memory to be 200g but still didn't work. Does anyone have an idea about what might have gone wrong here?

xiamengzhou avatar Jan 26 '20 02:01 xiamengzhou

I met the same problem when I was fine-tuning squad2.0 dataset, but fortunely it does not affect me getting the results--the ckpt file

urextra avatar Jun 28 '20 11:06 urextra